rasbt / llms-from-scratch Goto Github PK

View Code? Open in Web Editor NEW

22.1K 253.0 2.3K 9.42 MB

Implementing a ChatGPT-like LLM in PyTorch from scratch, step by step

Home Page: https://www.manning.com/books/build-a-large-language-model-from-scratch

License: Other

Jupyter Notebook 78.63% Python 21.35% Dockerfile 0.02%

chatgpt gpt large-language-models llm python pytorch

llms-from-scratch's Introduction

Build a Large Language Model (From Scratch)

This repository contains the code for coding, pretraining, and finetuning a GPT-like LLM and is the official code repository for the book Build a Large Language Model (From Scratch).

(If you downloaded the code bundle from the Manning website, please consider visiting the official code repository on GitHub at https://github.com/rasbt/LLMs-from-scratch.)

In Build a Large Language Model (From Scratch), you'll learn and understand how large language models (LLMs) work from the inside out by coding them from the ground up, step by step. In this book, I'll guide you through creating your own LLM, explaining each stage with clear text, diagrams, and examples.

The method described in this book for training and developing your own small-but-functional model for educational purposes mirrors the approach used in creating large-scale foundational models such as those behind ChatGPT.

Link to the official source code repository
Link to the book at Manning
Link to the book page on Amazon
ISBN 9781633437166

Please note that this README.md file is a Markdown (.md) file. If you have downloaded this code bundle from the Manning website and are viewing it on your local computer, I recommend using a Markdown editor or previewer for proper viewing. If you haven't installed a Markdown editor yet, MarkText is a good free option.

Alternatively, you can view this and other files on GitHub at https://github.com/rasbt/LLMs-from-scratch.

Tip

If you're seeking guidance on installing Python and Python packages and setting up your code environment, I suggest reading the README.md file located in the setup directory.

Chapter Title	Main Code (for quick access)	All Code + Supplementary
Setup recommendations	-	-
Ch 1: Understanding Large Language Models	No code	-
Ch 2: Working with Text Data	- ch02.ipynb - dataloader.ipynb (summary) - exercise-solutions.ipynb	./ch02
Ch 3: Coding Attention Mechanisms	- ch03.ipynb - multihead-attention.ipynb (summary) - exercise-solutions.ipynb	./ch03
Ch 4: Implementing a GPT Model from Scratch	- ch04.ipynb - gpt.py (summary) - exercise-solutions.ipynb	./ch04
Ch 5: Pretraining on Unlabeled Data	- ch05.ipynb - gpt_train.py (summary) - gpt_generate.py (summary) - exercise-solutions.ipynb	./ch05
Ch 6: Finetuning for Text Classification	- ch06.ipynb - gpt_class_finetune.py - exercise-solutions.ipynb	./ch06
Ch 7: Finetuning to Follow Instructions	- ch07.ipynb - gpt_instruction_finetuning.py - ollama_evaluate.py - exercise-solutions.ipynb	./ch07
Appendix A: Introduction to PyTorch	- code-part1.ipynb - code-part2.ipynb - DDP-script.py - exercise-solutions.ipynb	./appendix-A
Appendix B: References and Further Reading	No code	-
Appendix C: Exercise Solutions	No code	-
Appendix D: Adding Bells and Whistles to the Training Loop	- appendix-D.ipynb	./appendix-D
Appendix E: Parameter-efficient Finetuning with LoRA	- appendix-E.ipynb	./appendix-E

Shown below is a mental model summarizing the contents covered in this book.

Hardware Requirements

The code in the main chapters of this book is designed to run on conventional laptops within a reasonable timeframe and does not require specialized hardware. This approach ensures that a wide audience can engage with the material. Additionally, the code automatically utilizes GPUs if they are available.

Bonus Material

Several folders contain optional materials as a bonus for interested readers:

Citation

If you find this book or code useful for your research, please consider citing it:

@book{build-llms-from-scratch-book,
  author       = {Sebastian Raschka},
  title        = {Build A Large Language Model (From Scratch)},
  publisher    = {Manning},
  year         = {2024},
  isbn         = {978-1633437166},
  url          = {https://www.manning.com/books/build-a-large-language-model-from-scratch},
  github       = {https://github.com/rasbt/LLMs-from-scratch}
}

llms-from-scratch's People

Contributors

Stargazers

Watchers

Forkers

roshray stophobia lucapug pitmonticone lallouslab sgatea anyuanay dsghostpos3idon techthiyanes dattgoswami eulerianknight apollohuang1 miguelramosfdz r90941022 rkp64 alxevercodex alexandor91 javiervicho cwijayasundara valeman xy-liao croeder lumutek ashutoshsingh0223 mekongdelta-mind taocao mwaiton azeroth-dev ovimura arvidl mbrukman xiaotian0328 haijing junaidsheroz shuyib gitmalk nadimkaysar statlib qqmath ridhachahed leejodie xcondensate samanvayms danzeeeman ouma09 tdl77 veritatis yurigardinazzi ranshon kagelee ablenine startrekor naveedafzal knzhang petercao whmachao asali-cs fwwdn xmas25 foxnbckk jmaigc ghc2023 cuzy-zhi blessing-gao superbdong xenosfy dushwe maxiaoxifeng hellozhaojian lzg043 niskarsh12 ejaygit ajaymudhai07 leo-gan clever-boy marssovereign gab-e-ai vico munirabobaker frankisyao mamwadei cumberbar darylrodrigo cbrew itisjigar jordanshivers algoricky maxruby yuvaraj-rajulu intelligence-manifesto sumhncku edenbuaa seghelicious roysh mastersatish ptzagk xblj jplasser sdcodehub anhalu

llms-from-scratch's Issues

Difference btwn book and repo

Hi @rasbt - very much enjoying your book! Just a heads up about a difference between the book and repo I found. Results in the same value and code in the repo is what I expected. Screenshot attached. I think d_k = keys.shape[1].

RuntimeError: size mismatch - ch05/03_bonus_pretraining_on_gutenberg

I have an issue running pretraining_simple.py. I have downloaded ca. 50% of the files from Project Gutenberg via the gutenberg repo and then ran your scripts:

The text data preparation works fine so far:

prepare_dataset.py

root@9db1a84319a3:/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg# python prepare_dataset.py
--data_dir gutenberg/data
--max_size_mb 500
--output_dir gutenberg_preprocessed
16697 file(s) to process.

But when trying to train the model, it comes to a shape mismatch. It seems like the data will not be trained batch-wise:

pretraining_simple.py

root@9db1a84319a3:/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg# python pretraining_simple.py --
data_dir "gutenberg_preprocessed" --n_epochs 1 --batch_size 4 --output_dir model_checkpoints
Total files: 16
Tokenizing file 1 of 16: gutenberg_preprocessed/combined_1.txt
Training ...
Traceback (most recent call last):
File "/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py", line 200, in
train_losses, val_losses, tokens_seen = train_model_simple(
File "/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py", line 110, in train_model_simple
loss = calc_loss_batch(input_batch, target_batch, model, device)
File "/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/previous_chapters.py", line 247, in calc_loss_batch
loss = torch.nn.functional.cross_entropy(logits.flatten(0, -1), target_batch.flatten())
File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: size mismatch (got input: [205852672], target: [4096])

I believe the issue comes from the flatten func. In calc_loss_batch() in previous_chapters.py, what do you think about exchanging flatten() with using view()?

loss = torch.nn.functional.cross_entropy(logits.view(-1, logits.size(-1)), target_batch.view(-1))

Please double-check if this idea and output is correct.

I have run the updated script locally on my RTX 3080 Ti, the output is:

root@9db1a84319a3:/workspaces/LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg# python pretraining_simple.py --data_dir "gutenberg_preprocessed" --n_epochs 1 --batch_size 4 --output_dir model_checkpoints
Total files: 16
Tokenizing file 1 of 16: gutenberg_preprocessed/combined_1.txt
Training ...
Ep 1 (Step 0): Train loss 9.952, Val loss 9.663
Every effort moves you
Ep 1 (Step 100): Train loss 6.567, Val loss 6.906
Ep 1 (Step 200): Train loss 6.468, Val loss 6.637
Ep 1 (Step 300): Train loss 6.170, Val loss 6.578
Ep 1 (Step 400): Train loss 5.560, Val loss 6.485
Ep 1 (Step 500): Train loss 5.874, Val loss 6.381
Ep 1 (Step 600): Train loss 5.481, Val loss 6.449
Ep 1 (Step 700): Train loss 5.620, Val loss 6.314
...

Solution of Excercise 2.1 is included in both main code and solution notebooks (2.5 Byte pair encoding)

Hi @rasbt,

I found that solution to the Excercise 2.1 already exists also in the notebook with the main code (section "Experiments with unknown words")

Thank you.

llm

Throwing error for longer textual data like 9599

train loader:
Input batch dimensions: torch.Size([8, 9599])
Label batch dimensions torch.Size([8])

IndexError Traceback (most recent call last)
Cell In[32], line 6
2 model.to(device) # no assignment model = model.to(device) necessary for nn.Module classes
4 torch.manual_seed(123) # For reproducibility due to the shuffling in the training data loader
----> 6 train_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)
7 val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)
8 test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)

Cell In[31], line 14, in calc_accuracy_loader(data_loader, model, device, num_batches)
11 input_batch, target_batch = input_batch.to(device), target_batch.to(device)
13 with torch.no_grad():
---> 14 logits = model(input_batch)[:, -1, :] # Logits of last output token
15 predicted_labels = torch.argmax(logits, dim=-1)
17 num_examples += predicted_labels.shape[0]

File ~\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)

File ~\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None

File ~\OneDrive - Dell Technologies\Documents\DSS\LLM\Classification\previous_chapters.py:208, in GPTModel.forward(self, in_idx)
206 batch_size, seq_len = in_idx.shape
207 tok_embeds = self.tok_emb(in_idx)
--> 208 pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
209 x = tok_embeds + pos_embeds # Shape [batch_size, num_tokens, emb_size]
210 x = self.drop_emb(x)

File ~\AppData\Roaming\Python\Python311\site-packages\torch\nn\modules\sparse.py:162, in Embedding.forward(self, input)
161 def forward(self, input: Tensor) -> Tensor:
--> 162 return F.embedding(
163 input, self.weight, self.padding_idx, self.max_norm,
164 self.norm_type, self.scale_grad_by_freq, self.sparse)

File ~\AppData\Roaming\Python\Python311\site-packages\torch\nn\functional.py:2233, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2227 # Note [embedding_renorm set_grad_enabled]
2228 # XXX: equivalent to
2229 # with torch.no_grad():
2230 # torch.embedding_renorm_
2231 # remove once script supports set_grad_enabled
2232 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 2233 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

About endoftext in ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py

In https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/03_bonus_pretraining_on_gutenberg/pretraining_simple.py#L95
file, the '<|endoftext|>' symbol always appear at val_data_set, and train_data_set always not contains it.

Contributions for Chinese simplified version

hi, @rasbt~
This project is awesome and the tutorial structure is rather clear, I was able to get up and running quickly and I'm learning a lot from it. Really appreciate your work! Would you be interested in having a Chinese version of your project? So that LLM learners from China can refer to your work more efficiently. Maybe I can begin with README-zh.md?

Wrong number of token ids specified in the notebook (2.7 Creating token embeddings)

Hi @rasbt,

There is the following description in this section:

Previously, we have seen how to convert a single token ID into a three-dimensional
embedding vector. Let's now apply that to all four input IDs we defined earlier (torch.tensor([5, 1, 3, 2])):

But probably there is a typo in the notebook and you specified only 3 tokens for the same code (after cell [47]):

To embed all three input_ids values above, we do

Thank you.

Incorrect code output in the book (2.2 Tokenizing text)

Hi @rasbt,

I found that in the latest book version (v5) there is an incorrect code output in the section "2.2 Tokenizing text":

result = re.split(r'([,.]|\s)', text)
print(result)
We can see that the words and punctuation characters are now separate list entries just
as we wanted:
['Hello', ',', '', ' ', 'world.', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test.']

and

The resulting whitespace-free output looks like as follows:
['Hello', ',', 'world.', 'This', ',', 'is', 'a', 'test.']

But if we execute provided notebook, the output is correct.

P.S. It is a great pleasure to explore your next new book, especially about LLMs, thank you! :)

Thank you.

Expected all tensors to be on the same device

There appears to be an issue when running the code from chapter 6 (other sections not tested):

Error

Traceback (most recent call last):
  File "/home/user/workspace/project/llm/tune_incl.py", line 359, in <module>
    train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(
                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/workspace/project/llm/tune_incl.py", line 155, in train_classifier_simple
    loss = calc_loss_batch(input_batch, target_batch, model, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/workspace/project/llm/tune_incl.py", line 112, in calc_loss_batch
    logits = model(input_batch)[:, -1, :]  # Logits of last output token
             ^^^^^^^^^^^^^^^^^^
  File "/home/user/.venvs/main/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.venvs/main/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/workspace/project/llm/util.py", line 173, in forward
    return logits
             ^^^^^
  File "/home/user/.venvs/main/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.venvs/main/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.venvs/main/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

Cause

I narrowed it down to this line:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/gpt-class-finetune.py#L398

This replaces a the output layer after the model was moved to the GPU and later on trigger this error.

Solution

This issue can be mitigated by adding this just after that statement:

[...]
num_classes = 2
model.out_head = torch.nn.Linear(in_features=BASE_CONFIG["emb_dim"], out_features=num_classes)
# add this to move all model parameters to GPU
model = model.to(device)
[...]

Question about number of tokens in ChatGPT (2.5 Byte pair encoding)

Hi @rasbt,

Could you please clarify this sentence:

In fact, the BPE tokenizer that was used to train models such as GPT-2, GPT-3,
and ChatGPT has a total vocabulary size of 50,257, with <|endoftext|> being assigned
the largest token ID.

Which model do you mean by 'ChatGPT'?
I saw different definitions of this term and based on this definitions there are different vocabulary sizes:

text-davinci-003 (50k tokens)
gpt-3.5-turbo or/and gpt-4 (100k tokens)

Thank you.

Chapter 5 - Context Size and the DataLoaders

First off, great book!

Second, I noticed a small issue in Section 5.1.1 that stumped me for a bit.

"ctx_len": 256, # Shortened context length (orig: 1024)

If this is set to 1024, the val_loader will fail to load with the train_ratio of 0.90. Adjusting to 0.80 will load the data but the shape is mismatched.

Restoring the ctx_len to 256 fixes the issue.

I'm curious as to why this is occurring?

Error in the code in Listing A.13 (DDP-script.py)

Hi @rasbt,

I tried to run your DDP script and found that there is an error while executing this script "as-is":

PyTorch version: 2.2.1+cu121
CUDA available: True
Number of GPUs available: 2
Traceback (most recent call last):
  File "/home/user/app/DDP-script.py", line 178, in <module>
    mp.spawn(main, args=(world_size, num_epochs), nprocs=world_size)
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/user/miniconda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/user/app/DDP-script.py", line 128, in main
    features, labels = features.to(rank), labels.to(rank) # New: use rank
AttributeError: 'int' object has no attribute 'to'

The reason is the following incorrect line:

for features, labels in enumerate(train_loader):

which should be like that:

for idx, (features, labels) in enumerate(train_loader):

or like that (because idx was not used):

for features, labels in train_loader:

Thank you.

Probably a typo in multi-head attention description (3.6.1 Stacking multiple single-head attention layers)

Hi @rasbt,

I found the following statement in the mentioned section:

Figure 3.24 illustrates the structure of a multi-head attention module, which consists of
multiple single-head attention modules, as previously depicted in Figure 3.24, stacked on
top of each other.

Did you mean Figure 3.18 in the second case?

Thank you.

Inconsistencies between the code in the book and the notebooks (2.6 Data sampling with a sliding window)

Hi @rasbt,

I noticed that in the book you provide the following code with function name create_dataloader and the argument stride = max_length + 1 to avoid overlap in data even for targets:

dataloader = create_dataloader(raw_text, batch_size=8, max_length=4,
stride=5)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

But in the cell of the jupyter notebook with main code (cell [43]) and jupyter notebook with only dataloader (cell [2]) you use function with name create_dataloader_v1 and argument stride = max_length.

Could you please tell do I understand correctly that we need to use stride = max_length + 1 to avoid overfitting? Does the overlap in target (when stride = max_length) seriously increase the risk of overfitting?

Thank you.

ch06/03_bonus_imdb-classification

This might be still WIP, but I have issues reproducing the output in ch06/03_bonus_imdb-classification:

scripts gpt_download.py and previous_chapters.py missing in folder, therefore cannot run python train-gpt.py as instructed in the README
It seems like python download-prepare-dataset.py does not correctly create the test and validation set (train set seems to be fine though):

When copying the files from ch06/02_bonus_additional-experiments to ch06/03_bonus_imdb-classification, running python train-gpt.py results in a val loss of NaNs:

root@2f7823635ae1:/workspaces/LLMs-from-scratch/ch06/03_bonus_imdb-classification# python train-gpt.py
2024-05-14 12:55:48.662928: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-05-14 12:55:48.689449: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-14 12:55:49.247512: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
File already exists and is up-to-date: gpt2/124M/checkpoint
File already exists and is up-to-date: gpt2/124M/encoder.json
File already exists and is up-to-date: gpt2/124M/hparams.json
File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/124M/model.ckpt.index
File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: gpt2/124M/vocab.bpe
2024-05-14 12:55:57.722961: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 154389504 exceeds 10% of free system memory.
Ep 1 (Step 000000): Train loss 6.594, Val loss nan
Ep 1 (Step 000050): Train loss 2.141, Val loss nan
Ep 1 (Step 000100): Train loss 0.590, Val loss nan
Ep 1 (Step 000150): Train loss 0.107, Val loss nan
Ep 1 (Step 000200): Train loss 0.042, Val loss nan
Ep 1 (Step 000250): Train loss 0.030, Val loss nan
Ep 1 (Step 000300): Train loss 0.019, Val loss nan
Ep 1 (Step 000350): Train loss 0.011, Val loss nan

Similar issues with test and validation set also when running python train-bert-hf.py and python train-sklearn-logreg.py
-> insteatd of val.csv should be validation.csv in train-sklearn-logreg.py (as defined in download-prepare-dataset.py)

Solution for Exercise 3.3 is included in the notebook with main code (3.6.2 Implementing multi-head attention with weight splits)

Hi @rasbt,

It seems that cell [40] in the notebook with main code contains solution to Exercise 3.3.

Thank you.

suggestion of adding torch.profile

i just check out the code of appendix-A/01_main-chapter-code /DDP-script.py,how about adding

from torch.profiler import profile
with profile() as prof:
    #the main function training code
if rank == 0:
    print("exporting trace")
    prof.export_chrome_trace("trace_ddp_simple.json")

than we can see the tracing profiling json file in google Chrome

Question about implementation of CausalAttention class (3.5.3 Implementing a compact causal self-attention class)

Hi @rasbt,

This notebook contains the following implementaion of CausalAttention:

class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, block_size, dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask', torch.triu(torch.ones(block_size, block_size), diagonal=1)) # New

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec

I have a question - why do we need the following 2 lines in the forward() method implementation:

def forward(self, x):It
        b, num_tokens, d_in = x.shape # New batch dimension b
        ...
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        ...

Can we remove the first line and just replace the second line to the following code:

attn_scores.masked_fill_(self.mask.bool(), -torch.inf)

As I understand num_tokens = batch_size and we provide batch_size value as the argument, so neither calculating x.shape nor indexing [:num_tokens, :num_tokens] is required.
Is it correct?

Thank you.

Inconsistencies in output for dropout section (3.5.2 Masking additional attention weights with dropout)

Hi @rasbt,

I am trying to explore and reproduce Chapter 3 and found that I can't reproduce results that you specified in the notebook and the book, even if I download notebook and run without any changes.
The difference appears only starting with the following 2 cells (I haven't checked the next cells yet):

Cell [31]

torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5) # dropout rate of 50%
example = torch.ones(6, 6) # create a matrix of ones

print(dropout(example))

Your output

tensor([[2., 2., 0., 2., 2., 0.],
        [0., 0., 0., 2., 0., 2.],
        [2., 2., 2., 2., 0., 2.],
        [0., 2., 2., 0., 0., 2.],
        [0., 2., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.]])

My output

tensor([[2., 2., 2., 2., 2., 2.],
        [0., 2., 0., 0., 0., 0.],
        [0., 0., 2., 0., 2., 0.],
        [2., 2., 0., 0., 0., 2.],
        [2., 0., 0., 0., 0., 2.],
        [0., 2., 0., 0., 0., 0.]])

Cell [32]

torch.manual_seed(123)
print(dropout(attn_weights))

Your output

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7599, 0.6194, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4921, 0.4925, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3966, 0.0000, 0.3775, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.3331, 0.3084, 0.3331, 0.0000]],
       grad_fn=<MulBackward0>)

My output

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.8966, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.6206, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4921, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4350, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.3327, 0.0000, 0.0000, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)

Thank you.

tiktoken is not running in jupyter notebook

Hello Razbt,
Nice to meet you! I've been enjoying your book so far (LLMs from scratch), but I find the examples hard to follow as some of the tools used do not mention which versions you used. I tried to follow along but packages like tiktoken and pytorch refuse to work, or even get installed. I tried using conda to install environments with both Python 3.9 and 3.10. and both successfully install tiktoken, but fail to import it in the jupyter notebook. The command I ran to attempt installation was pip install tiktoken.

Can you let me know which version of Python / tiktoken / pytorch you were using? Is there any intermediate step I missed?

I am running Windows 11 and an (non-Nividia) GPU.

load_weights_into_gpt Getting error Ch : 5

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb

import numpy as np

def load_weights_into_gpt(gpt, params):
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
    
    for b in range(len(params["blocks"])):
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.weight = assign(
            gpt.trf_blocks[b].att.W_query.weight, q_w.T)
        gpt.trf_blocks[b].att.W_key.weight = assign(
            gpt.trf_blocks[b].att.W_key.weight, k_w.T)
        gpt.trf_blocks[b].att.W_value.weight = assign(
            gpt.trf_blocks[b].att.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.bias = assign(
            gpt.trf_blocks[b].att.W_query.bias, q_b)
        gpt.trf_blocks[b].att.W_key.bias = assign(
            gpt.trf_blocks[b].att.W_key.bias, k_b)
        gpt.trf_blocks[b].att.W_value.bias = assign(
            gpt.trf_blocks[b].att.W_value.bias, v_b)

        gpt.trf_blocks[b].att.out_proj.weight = assign(
            gpt.trf_blocks[b].att.out_proj.weight, 
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].att.out_proj.bias = assign(
            gpt.trf_blocks[b].att.out_proj.bias, 
            params["blocks"][b]["attn"]["c_proj"]["b"])

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight, 
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        gpt.trf_blocks[b].ff.layers[0].bias = assign(
            gpt.trf_blocks[b].ff.layers[0].bias, 
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        gpt.trf_blocks[b].ff.layers[2].weight = assign(
            gpt.trf_blocks[b].ff.layers[2].weight, 
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].ff.layers[2].bias = assign(
            gpt.trf_blocks[b].ff.layers[2].bias, 
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        gpt.trf_blocks[b].norm1.scale = assign(
            gpt.trf_blocks[b].norm1.scale, 
            params["blocks"][b]["ln_1"]["g"])
        gpt.trf_blocks[b].norm1.shift = assign(
            gpt.trf_blocks[b].norm1.shift, 
            params["blocks"][b]["ln_1"]["b"])
        gpt.trf_blocks[b].norm2.scale = assign(
            gpt.trf_blocks[b].norm2.scale, 
            params["blocks"][b]["ln_2"]["g"])
        gpt.trf_blocks[b].norm2.shift = assign(
            gpt.trf_blocks[b].norm2.shift, 
            params["blocks"][b]["ln_2"]["b"])

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
    
    
load_weights_into_gpt(gpt, params)
gpt.to(device);

Error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[232], [line 64](vscode-notebook-cell:?execution_count=232&line=64)
     [60](vscode-notebook-cell:?execution_count=232&line=60)     gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
     [61](vscode-notebook-cell:?execution_count=232&line=61)     gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
---> [64](vscode-notebook-cell:?execution_count=232&line=64) load_weights_into_gpt(gpt, params)
     [65](vscode-notebook-cell:?execution_count=232&line=65) gpt.to(device)

Cell In[232], [line 4](vscode-notebook-cell:?execution_count=232&line=4)
      [3](vscode-notebook-cell:?execution_count=232&line=3) def load_weights_into_gpt(gpt, params):
----> [4](vscode-notebook-cell:?execution_count=232&line=4)     gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
      [5](vscode-notebook-cell:?execution_count=232&line=5)     gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
      [7](vscode-notebook-cell:?execution_count=232&line=7)     for b in range(len(params["blocks"])):

TypeError: 'ellipsis' object is not subscriptable

Possible typos in Chapter 2 (Working with text data)

Bug description

2.3 Converting tokens into token IDs
After determining that the vocabulary size is 1,159 via this code, we create the vocabulary and print its first 51 entries for illustration purposes.

2.4 Adding special context tokens
Based on the output of this print statement, the new vocabulary size is 1,161 (the vocabulary size in the previous section was 1,159).

I fetched the latest repo and ran the notebook, and got 1,130 as my vocab size instead.

Fig 2.17
"..., the token ID 5, whether it's in the first or third position in the token ID input vector, ..."

I'm not sure if I'm misunderstanding the picture. Shouldn't it be "..., the token ID 2, whether it's in the first or fourth position in the token ID input vector, ..."?

What operating system are you using?

Windows

Where do you run your code?

Local (laptop, desktop)

Environment

[OK] Your Python version is 3.11.6
2024-06-18 22:09:30.458860: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-18 22:09:31.751283: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[OK] torch 2.3.1+cpu
[OK] jupyterlab 4.2.2
[OK] tiktoken 0.7.0
[OK] matplotlib 3.9.0
[OK] numpy 1.26.4
[OK] tensorflow 2.16.1
[OK] tqdm 4.66.4
[OK] pandas 2.2.2
[OK] psutil 5.9.8

do have a doc for hardware specs

Solution for Exercise 3.2 is included in the notebook with main code (3.6.1 Stacking multiple single-head attention layers)

Hi @rasbt,

It seems that cell [36] in the notebook with main code contains solution to Exercise 3.2.

Thank you.

chapter 5 Exercise 5.4: Continued pretraining optimizer not on the same device

Bug description

when i load a optimizer serialized PyTorch state dictionary file from the following code:

optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

and then continue pretraining gpt model,the error of torch tensor on different device emerged, so i implement a function which moving optimizer from CPU to GPU as below:

def optimizer_to(optim, device):
    for param in optim.state.values():
        # Not sure there are any global tensors in the state dict
        if isinstance(param, torch.Tensor):
            param.data = param.data.to(device)
            if param._grad is not None:
                param._grad.data = param._grad.data.to(device)
        elif isinstance(param, dict):
            for subparam in param.values():
                if isinstance(subparam, torch.Tensor):
                    subparam.data = subparam.data.to(device)
                    if subparam._grad is not None:
                        subparam._grad.data = subparam._grad.data.to(device)

afterwards

# Send to GPU
optimizer_to(optimizer,device)

the training model will proceed in success,i've checked out the pytorch documentation and not find any solution about optimizer params tensors interchangeable and placed between different devices,idk maybe there is a better way

What operating system are you using?

None

Where do you run your code?

None

Environment

Offering Chinese Translation for 'Build a Large Language Model From Scratch

Dear Dr. Sebastian Raschka,

Greetings! I am a researcher passionate about machine learning and artificial intelligence. As a native Chinese speaker, I would like to extend my deepest respect and gratitude for the open-source repository of "Build a Large Language Model From Scratch" that you have made available on GitHub. This book is not only comprehensive and beautifully illustrated but also organized in such a manner that beginners like myself find it both intuitive and easy to understand. Your work showcases profound expertise while being incredibly accessible to newcomers, from which I have greatly benefited.

Above all, I am inspired by your passion for AI and open-source software. Motivated by this passion, I have embarked on a project to translate your book and its associated code into Chinese. This effort aims to assist Chinese-speaking learners, like me, in better understanding the process of building large language models. To date, I have completed the translation of the first four chapters. During this process, I have made a concerted effort to clarify any contextual differences and added some foundational knowledge to help beginners grasp the material more effectively.

I am eager to contribute my translated version to the project and wonder if it would be possible to do so by including a link to my forked version in the official GitHub repository's readme or through another method you deem appropriate. My forked version is located at Intelligence-Manifesto/LLMs-from-scratch, which contains the translation work completed so far.

With this letter, I wish to express not only my admiration and thanks for this invaluable book but also seek your guidance and assistance on how I might integrate my work into this admirable open-source project in a suitable manner. How might I contribute my translation so that more Chinese readers can benefit?

Thank you again for your outstanding work and contributions to the open-source community. I look forward to your response.

Sincerely,
Intelligence-Manifesto

Unable to reproduce notebook (Evaluating Instruction Responses Locally Using a Llama 3 Model Via Ollama)

Hello! Bought the book and am enjoying it so far. I came across this notebook which I'm trying to reproduce but with little success.

I have followed the instructions:

Installed ollama for windows
Able to run & chat with Llama3 in Windows Powershell

However, when I run this chunk of code:

import urllib.request
import json

def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "seed":123,        # for deterministic responses
        "temperature":0,   # for deterministic responses
        "messages": [
            {"role": "user", "content": prompt}
        ]
    }

    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data


result = query_model("What do Llamas eat?")
print(result)

I'm getting this error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[/usr/lib/python3.10/urllib/request.py](https://localhost:8080/#) in do_open(self, http_class, req, **http_conn_args)
   1347             try:
-> 1348                 h.request(req.get_method(), req.selector, req.data, headers,
   1349                           encode_chunked=req.has_header('Transfer-encoding'))

15 frames
OSError: [Errno 99] Cannot assign requested address

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
[/usr/lib/python3.10/urllib/request.py](https://localhost:8080/#) in do_open(self, http_class, req, **http_conn_args)
   1349                           encode_chunked=req.has_header('Transfer-encoding'))
   1350             except OSError as err: # timeout error
-> 1351                 raise URLError(err)
   1352             r = h.getresponse()
   1353         except:

URLError: <urlopen error [Errno 99] Cannot assign requested address>

Running http://localhost:11434/api in my browser gives a "404 page not found" error, but running http://localhost:11434/ gives me a notifcation that "Ollama is running".

Was wondering if you've encountered this error, or would know how to solve this? Thanks!

Will this book talk about RLHF?

Great book! I read all the notebooks in this repo and here is a question.

I heard RLHF(Reinforcement Learning from Human Feedback) is the core technique of ChatGPT. Is this book talking about it?

I see there will be an extra material about dpo on preference fine-tuning. Is it equivalent to RLHF? What's the popular practice in the current industry after instruct-finetuning?

Thanks!

GPT-2 architecture

I might have found a slight mistake in the visualization of the transformer architecture. It's illustrated with pre-LN:

That transformer model is also used in some figures, in ch06/01_main-chapter-code/ch06.ipynb, section "6.5 Adding a classification head" there is a visualization of the GPT-2 architecture:

When using Pre-LN, shouldn't the residual connection be after that "LN 2" step, like

References:

book feedback

hi @rasbt : fantastic work - and code which is clean and readable.

One small feedback / issue, I noticed with the "early access book" is that in chapter 3 , the manual seed of 789 is missing - which is what brought my here :)

Feedback: Stripe output from notebook

This book is a wonderful read, just wanted to submit one small comment on the notebooks which could just be personal learning style. It's nice to have to run the actual notebook to get the output so block-by-block it's easier to focus on that without being distracted with the output already rendered. So maybe there could 2 notebooks per chapter, a clean one and a completed one? In the meantime I'm just using nbstripeout locally but wanted to pass along the feedback.

Make it clear in REAME.md what this repository is for

When reading the README.md for this repository, it's not immediately clear what this repository contains or what it is for. I think this should be clarified.

Several package requirements from bonus material are not specified in requirements.txt (Tokenizers comparison)

Hi @rasbt,

I don't know if packages from the notebooks with bonus materials like this notebook with tokenizers comparison are intended to be included in requirements.txt, but there are 2 missing libraries:

tqdm (which is required by import from bpe_openai_gpt2 import get_encoder, download_vocab)
transformers

To simplify the work with the control of the libraries used for this project I use poetry which is great to track all explicit and implicit dependencies, so if you want I can send you my configuration for it.

Thank you.

Output of the cell without variable specified (Embedding Layers and Linear Layers)

Hi @rasbt,

There is a cell [28] in this notebook where there is an output but no variable to output is specified (probably it was linear.weight which was deleted after cell execution):

torch.manual_seed(123)
linear = torch.nn.Linear(num_idx, out_dim, bias=False)
---
Parameter containing:
tensor([[-0.2039,  0.0166, -0.2483,  0.1886],
        [-0.4260,  0.3665, -0.3634, -0.3975],
        [-0.3159,  0.2264, -0.1847,  0.1871],
        [-0.4244, -0.3034, -0.1836, -0.0983],
        [-0.3814,  0.3274, -0.1179,  0.1605]], requires_grad=True)

Thank you.

Encoding/decoding transformation of the text (2.3 Converting tokens into token IDs)

Hi @rasbt,

I noticed that when we decode the following encoded sentence:

"It's the last he painted, you know," Mrs. Gisburn said with
pardonable pride.

We will have additional leading spaces at the start of the sentence and after apostrophe in the word It' s:

 "It' s the last he painted, you know," Mrs. Gisburn said with
pardonable pride.

Formally, this does not matter for our case, because we do not take into account spaces, but in general, here we do not precisely restore the original text, right?

Could you please tell if you are interested in such insignificant feedback like this or it is not worth the notes or new issues?

Thank you.

{Q} : Replacing the LlamaDecoderLayer Class hugging Face With New LongNet

The definition of stride is confusing in 2.6

hi @rasbt，what a amazing job. But the definition of stride confuses me as follow:

We use a sliding window approach where we slide the window one word at a time (this is also known as stride=1):

An example using stride equal to the context length (here: 4) as shown below:

I think stride is the separation distance between two inputs. In fig 1, two inputs The distance between the two inputs is actually four words. But now，stride marks the distance between input and target.

Incorrect description of function torch.arange() (2.8 Encoding word positions)

Hi @rasbt,

There is a probably typo in the description of torch.arange() function here:

As shown in the preceding code example, the input to the pos_embeddings is usually a
placeholder vector torch.arange(block_size), which contains a sequence of numbers
1, 2, ..., up to the maximum input length.

I think you mean the range 0, 1, ..., up to the maximum input length - 1?

Thank you.

Missing encoder.json and vocab.bpe for running bpe_openai_gpt2 (02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb)

FileNotFoundError occured when trying to instantiate the bpe_openai_gpt2 as following

--------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[20], line 1
----> 1 orig_tokenizer = get_encoder(model_name="gpt2", models_dir=".")

File ~/localdev/python/LLMs-from-scratch/ch02/02_bonus_bytepair-encoder/bpe_openai_gpt2.py:140, in get_encoder(model_name, models_dir)
    139 def get_encoder(model_name, models_dir):
--> 140     with open(os.path.join(models_dir, model_name, 'encoder.json'), 'r') as f:
    141         encoder = json.load(f)
    142     with open(os.path.join(models_dir, model_name, 'vocab.bpe'), 'r', encoding="utf-8") as f:

FileNotFoundError: [Errno 2] No such file or directory: './gpt2/encoder.json'

Incorrect diagram: chapter 2 under "This chapter covers attention mechanisms, the engine of LLMs"

The dotted line should be used for '2) Attention mechanism'

https://camo.githubusercontent.com/cac4891486ac6efc302a1b87c45c7a501e8efb266284ab1811d38c38919dfc72/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830335f636f6d707265737365642f30312e77656270

https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb

stride value caused skipping one word

"dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=5, shuffle=False)\n",
This code does skip one word, which is different to the text in the book saying we do not skip a word and do not overlap. stride=4 make it consistent with the book.

class MHAPyTorchScaledDotProduct

Thanks for the great work. I have several questions about class MHAPyTorchScaledDotProduct in mha-implementations.ipynb:

class MHAPyTorchScaledDotProduct(nn.Module):
    def __init__(self, d_in, d_out, num_heads, context_length, dropout=0.0, qkv_bias=False):
        super().__init__()

        assert d_out % num_heads == 0, "embed_dim is indivisible by num_heads"

        self.num_heads = num_heads
        self.context_length = context_length
        self.head_dim = d_out // num_heads
        self.d_out = d_out

        self.qkv = nn.Linear(d_in, 3 * d_out, bias=qkv_bias)
        self.proj = nn.Linear(d_in, d_out)
        self.dropout = dropout

        self.register_buffer(
            "mask", torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        batch_size, num_tokens, embed_dim = x.shape

        # (b, num_tokens, embed_dim) --> (b, num_tokens, 3 * embed_dim)
        qkv = self.qkv(x)

        # (b, num_tokens, 3 * embed_dim) --> (b, num_tokens, 3, num_heads, head_dim)
        qkv = qkv.reshape(batch_size, num_tokens, 3, self.num_heads, self.head_dim)

        # (b, num_tokens, 3, num_heads, head_dim) --> (3, b, num_heads, num_tokens, head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)

        # (3, b, num_heads, num_tokens, head_dim) -> 3 times (b, num_heads, num_tokens, head_dim)
        queries, keys, values = qkv.unbind(0)

        use_dropout = 0. if not self.training else self.dropout
        context_vec = nn.functional.scaled_dot_product_attention(
            queries, keys, values, attn_mask=None, dropout_p=use_dropout, is_causal=True)

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)

        return context_vec

I am not sure which one is better: .reshape() or .view()?

        # (b, num_tokens, 3 * embed_dim) --> (b, num_tokens, 3, num_heads, head_dim)
        qkv = qkv.reshape(batch_size, num_tokens, 3, self.num_heads, self.head_dim)

        # (b, num_tokens, 3 * embed_dim) --> (b, num_tokens, 3, num_heads, head_dim)
        qkv = qkv.view(batch_size, num_tokens, 3, self.num_heads, self.head_dim)

.unbind(0) is not necessary (the shape of queries, keys, values does not change without it), is it a speed concern?

        # (3, b, num_heads, num_tokens, head_dim) -> 3 times (b, num_heads, num_tokens, head_dim)
        queries, keys, values = qkv.unbind(0)

        # (3, b, num_heads, num_tokens, head_dim) -> 3 times (b, num_heads, num_tokens, head_dim)
        queries, keys, values = qkv

According to the equivalent implementation in F.scaled_dot_product_attention(), it seems like self.proj() is missing at the end:

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)

        return context_vec

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)

        context_vec = self.proj(context_vec)

        return context_vec

Again, I am not sure which one is better: .reshape() or .view()?

        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, num_tokens, self.d_out)

        return context_vec

ch06 - fine-tuning an LLM for binary classification task - add vs update output layer

I am following the example from ch06 for fine-tuning an LLM for classification task. When I run the following code from the example, it doesn't update the layer but add a new lm_head. Is this expected? Shouldn't I update the existing lm_head ?

num_classes = 2

peft_model.base_model.lm_head = torch.nn.Linear(in_features=peft_model.get_input_embeddings().embedding_dim, out_features=num_classes, bias=False)

peft_model.base_model.lm_head

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.01, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.01, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.01, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=32, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=32, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): MistralRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MistralRMSNorm()
            (post_attention_layernorm): MistralRMSNorm()
          )
        )
        (norm): MistralRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
    (lm_head): Linear(in_features=4096, out_features=2, bias=False)
  )
)

requirements.txt

Hi,

Can you please add a requirements.txt to the repo as well (to set the environment for book in one go, without needing to install every package manually)?

In 3.3.1, there seems to be a missing image between "The attention weights and context vector calculation are summarized in the figure below:" and "The code below walks through the figure above step by step."

By convention, the unnormalized attention weights are referred to as "attention scores" whereas the normalized attention scores, which sum to 1, are referred to as "attention weights"
The attention weights and context vector calculation are summarized in the figure below:

In 3.3.1, there seems to be a missing image between "The attention weights and context vector calculation are summarized in the figure below:" and "The code below walks through the figure above step by step."

Perhaps the sentence needs to be modified

ch07/03_model-evaluation - prompt

In the notebooks of ch07/03_model-evaluation, the same prompt

prompt = (f"Given the input `{format_input(entry)}` "
              f"and correct output `{entry['output']}`, "
              f"score the model response `{entry['model 1 response']}`"
              f" on a scale from 0 to 100, where 100 is the best score. "
    )

will be used multiple time in the notebooks. What do you think about moving it in a separate function that it needs to be set up only once? From my pov, this would make the code more modular, maintainable, and easier to read.

I have tried this for the ollama nb (not for the openai nb):
https://github.com/d-kleine/LLMs-from-scratch/blob/main/ch07/03_model-evaluation/llm-instruction-eval-ollama.ipynb

You can the differences here (along with code formatting and a typo fix):
https://github.com/d-kleine/LLMs-from-scratch/blob/main/ch07/03_model-evaluation/llm-instruction-eval-ollama.ipynb

What do you think about this idea?

Inconsistencies in unsqueeze operation description in the book and in notebook and its necessity (3.6.2 Implementing multi-head attention with weight splits)

Hi @rasbt,

I found that implementation of the MultiHeadAttention class has the following line:

mask_unsqueezed = mask_bool.unsqueeze(0).unsqueeze(0)

But there is only one unsqueeze operation in the notebook:

mask_unsqueezed = mask_bool.unsqueeze(0)

But as I understand we can skip unsqueeze operation at all because masked_fill_() method supports broadcasting

Thank you.

Some typos in ch06.ipynb

I'm trying to fix some typos in ch06.ipynb here: #219

There is also a typo in the picture below

The item 3 should be Reset loss gradients from previous batch.

Inconsistencies in MHA Wrapper Implementation Between Chapter 3 Main Content and Bonus Material

In the notebook ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb, the parameter d_out is not divided by num_heads. As a result, the shape differs from other implementations: [8, 1024, 9216] versus [8, 1024, 768]. Additionally, the implementation lacks the final projection.

It is correctly implemented in ch03\01_main-chapter-code\multihead-attention.ipynb cell 6 and 7.

This inconsistency leads to a significant performance gap in the subsequent cells.

rasbt / llms-from-scratch Goto Github PK

llms-from-scratch's Introduction

Build a Large Language Model (From Scratch)

Table of Contents

Hardware Requirements

Bonus Material

Citation

llms-from-scratch's People

Contributors

Stargazers

Watchers

Forkers

llms-from-scratch's Issues

Error

Cause

Solution

Bug description

What operating system are you using?

Where do you run your code?

Environment

Bug description

What operating system are you using?

Where do you run your code?

Environment

Recommend Projects

Recommend Topics

Recommend Org