Comments (3)
GPT models use masked attention that ensure their predictions are based only on prior tokens:
Line 71 in 4050db6
Conceptually a GPT model would be evaluated like so:
for i in range(len(x)):
input = x[:i]
target = y[i]
output = model(input)
loss += loss_func(output, target)
For example, its prediction for the 4th output token is based on the first four input tokens. So it's trying to predict 0
based on just 4, 7, 1, 7
.
In practice though the architecture of GPT-like models allow one to compute that entire for loop in a single pass, using masking to ensure that each "column" of the model can only "see" previous tokens.
from mingpt.
@fpgaminer Good catch! Thank you.
It was a miss on my part. I knew that GPT models use masked attention (since they are decoder-only models), but since I was processing this addition problem as a seq2seq formulation, the blindness was on me! So this is a nonissue.
Thanks for clarifying immediately. I would like to keep the issue for a couple of hours, just for illustration to others who might have not recognized this fact, then I will close it.
Thanks
Ravi
from mingpt.
Non-issue, since gpt uses masked attention that always hides token from the future.
from mingpt.
Related Issues (20)
- how does this compare to aitextgen?
- Information leak in training procedure?
- Crashed Encoder possible data corruption
- Simplifying weigh decay checking doesn't work HOT 3
- About layer norm dimention parameter: HOT 1
- 生成圖片
- Question: does it support other utf-8 natual language? HOT 1
- Output of CausalSelfAttention HOT 1
- How can I run a trained model and can't run Test_ Hugging face_ Import. py HOT 1
- AssertionError when run generate.ipynb with default parameter HOT 4
- Should -1 marker (as special token) be counted in vocab_size? HOT 1
- What's the max output tokens this model supports? HOT 1
- what is the minimum hardware requirement to train
- which pytorch version should be used pls for windows OS only CPU use only for inference ?
- error line 200, in from_pretrained assert len(keys) == len(sd) HOT 7
- concatenate two BPE tokenizer
- Support for Multi-GPU Parallel Training in chargpt.py
- how to build a model and interact with it like chatgpt?
- Strange model behavior when taking the softmax in the wrong dimension
- where did the self.bias get defined in the casual attention class HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mingpt.