I like deep neural nets.
karpathy / nn-zero-to-hero Goto Github PK
View Code? Open in Web Editor NEWNeural Networks: Zero to Hero
License: MIT License
Neural Networks: Zero to Hero
License: MIT License
I like deep neural nets.
loss = -prob[torch.arange(32), Y].log().mean()
loss
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[20], line 1
----> 1 loss = -prob[torch.arange(32), Y].log().mean()
2 loss
IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [32], [228146]
Hello Andrej, in your note book for makemore part3, you built BatchNorm1d as
class BatchNorm1d:
def __init__(self, dim, eps=1e-5, momentum=0.1):
self.eps = eps
self.momentum = momentum
self.training = True
# parameters (trained with backprop)
self.gamma = torch.ones(dim)
self.beta = torch.zeros(dim)
# buffers (trained with a running 'momentum update')
self.running_mean = torch.zeros(dim)
self.running_var = torch.ones(dim)
def __call__(self, x):
# calculate the forward pass
if self.training:
xmean = x.mean(0, keepdim=True) # batch mean
xvar = x.var(0, keepdim=True) # batch variance
else:
xmean = self.running_mean
xvar = self.running_var
xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
self.out = self.gamma * xhat + self.beta
# update the buffers
if self.training:
with torch.no_grad():
self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
return self.out
def parameters(self):
return [self.gamma, self.beta]
I think there may be sth wrong in xvar = self.running_var
. If the input has batch size=1, then this var
function will use unbiased function to calculate the variance, since 1-1 = 0
, it will result in a divided by 0 error.
I fact, when I ran your sampling code:
# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)
for _ in range(20):
out = []
context = [0] * block_size # initialize with all ...
while True:
# forward pass the neural net
emb = C[torch.tensor([context])] # (1,block_size,n_embd)
x = emb.view(emb.shape[0], -1) # concatenate the vectors
for layer in layers:
x = layer(x)
logits = x
probs = F.softmax(logits, dim=1)
# sample from the distribution
ix = torch.multinomial(probs, num_samples=1, generator=g).item()
# shift the context window and track the samples
context = context[1:] + [ix]
out.append(ix)
# if we sample the special '.' token, break
if ix == 0:
break
print(''.join(itos[i] for i in out)) # decode and print the generated word
I got an error at BatchNorm1d layer. Since the input x
has batch_size=1, the calculated variance is all nan
.
Btw, when I used PyTorch's implementation
layers = [
Linear(n_embd * block_size, n_hidden), Tanh(),
Linear( n_hidden, n_hidden), Tanh(),
Linear( n_hidden, n_hidden), Tanh(),
Linear( n_hidden, n_hidden), Tanh(),
Linear( n_hidden, n_hidden), Tanh(),
Linear( n_hidden, vocab_size),
]
instead of
layers = [
Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
Linear( n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]
everything was fine.
I know your notebook has been run and tested, so I think there is sth I missed in the video or notebook?
I ran into an interesting issue in makemore 4 backpro ninja where the dhpreact was not exactly matching the hpreact.grad.
However, this was only in the collab notebook because when I put the same code into a local jupyter notebook it works fine.
Not sure why this would be the case but just an odd curiosity.
If you find it interesting, I would like to propose Time Series Forecasting and the difficulties associated with Transfer Learning as a new topic for the upcoming lecture. Thank you for your content, I really appreciate it!
Hi Andrej, hi everyone,
First of all, let me add my voice to the chorus: such awesome lectures, very grateful for them, I recommend them around me as soon as I have the opportunity!
At one point in the backprop lecture, you mention that there might be slicker way to update the last gradient tensor, dC
, instead of the Python loop you used. This tickled my curiosity, so I tinkered, and here's the solution I came up with, maybe others have found even better ways! (Although, arguably, if you're not into Torch nerdiness the threat to time management/peace of mind when basking in advanced indexing might not be lead to a great trade-off with the slow but straightforward loop! : >)
So, instead of:
dC = torch.zeros_like(C)
for k in range(Xb.shape[0]):
for j in range(Xb.shape[1]):
ix = Xb[k,j]
dC[ix] += demb[k,j]
It is possible to do:
# arange -> unsqueeze -> tile -> flatten
# [ 0,1,...32 ] -> [[0], -> [[0,0,0], -> [0,0,0,1,1,1,...,31,31,31] # batch_size * block_size times
# [1], [1,1,1],
# ... ...
# [31]] [31,31,31]]
rows_xi = torch.tile(torch.arange(0, Xb.shape[0]).unsqueeze(1), (1,3)).flatten()
# [0,1,2] -> [[0,1,2],[0,1,2],...,[0,1,2]] # block_size * batch_size times
cols_xi = torch.tile(torch.arange(0, Xb.shape[1]), (Xb.shape[0],))
emb_xi = Xb[rows_xi, cols_xi] # block_size * batch_size indices to retrieve rows
dC1 = torch.zeros_like(C)
dC1.index_put_((emb_xi,), demb[rows_xi, cols_xi], accumulate=True)
A torch.allclose(dC1, dC)
yields True
on my end.
I'm indebted to the all-answering @ptrblck for that .index_put_(... accumulate=True)
reference!
Have a great day!
name.txt is missing
Doing some research into the method of finding an optimal learning rate.
I made the models both from scratch as the videos and also in a torch friendly way, or well using torch modules, dataloaders, optmizer, etc ..
However something weird when running the following, which 'should' be same as code from video. The lr - loss graph is showed in im1 below.
im2 is using code very similar to videos, i.e manually updating weights. Why are the results not the same ? Is the optimizer doing somethign different in the backend ? Over all the training is about the same, both will converge roughly at the same rate.
def findlr(model, data, test_dataloader):
lrs = torch.linspace(0.01, 1, 1000)
lrs = 10**torch.linspace(-3, 0, 1000)
lri = []
lss = []
optim = torch.optim.SGD(model.parameters(), lr=lrs[0])
for i in range(len(lrs)):
for g in optim.param_groups:
g['lr'] = lrs[i]
x, y = next(iter(data))
l = calcLoss(model(x), y)
model.zero_grad()
l.backward()
optim.step()
lri.append(lrs[i].item())
lss.append(l.item())
print(lrs[i], l.item())
plt.plot(lri, lss)
plt.show()
```
im1:
<img width="597" alt="image" src="https://user-images.githubusercontent.com/95486801/193437520-55d14507-867c-411f-9c9e-e11db3b9e67c.png">
im2:
<img width="597" alt="image" src="https://user-images.githubusercontent.com/95486801/193437537-481b38c6-447a-4bf7-86ee-bffb291b737a.png">
Hi Andrej, this is really great and thank you so very much for the material. It is truly super useful!
I wonder, what would you use before this series, to introduce neural nets, what they do / can do, the foundation before diving into back-prop?
Have you seen a nice lecture that does that in a simple way, just aligned with your course and material?
Thank you karpathy for open sourcing this great course series.
I think the discussion board need be opened.
I found that in the process of learning, there were many thoughts and questions rather than issues. I think these thoughts should be enlightening to others, but because they are not issues, I cannot find a suitable place to post these thoughts.
Getting below error while running the backward pass:
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
Code
#Forward Pass
logits = (xenc @ W)
counts = logits.exp()
prob = counts/counts.sum(1,keepdim=True)
loss = - prob[torch.arange(5),ys].log().mean()
print(loss.item())
#Backward Pass
W.grad=None
loss.backward()
#update the weights
W.data += -0.1 * W.grad
Query:
Why are performing the one-hot encoding for the input every-time while iterating the forward pass?
Hello Andrej. First of all, I would like to express my gratitude to you for sharing such a valuable videos with us for free.
While watching the 'makemore part4' video, I was also trying to apply it to my own created dataset. When I tried to take the chained derivative in the 'dhpreact' part, it started to give an error output, and since it is a chain derivative operation, it also included subsequent outputs. Below, I share the code line and the output.
Please share any other solution if you have one. Using different Torch versions and changing the dtype to 'double' as suggested in the comments didn't work out for me.
dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), Yb] = -1.0/n
dprobs = (1.0 / probs) * dlogprobs
dcounts_sum_inv = (counts * dprobs).sum(1, keepdim=True)
dcounts = counts_sum_inv * dprobs
dcounts_sum = (-counts_sum**-2) * dcounts_sum_inv
dcounts += torch.ones_like(counts) * dcounts_sum
dnorm_logits = counts * dcounts
dlogits = dnorm_logits.clone()
dlogit_maxes = (-dnorm_logits).sum(1, keepdim=True)
dlogits += F.one_hot(logits.max(1).indices, num_classes=logits.shape[1]) * dlogit_maxes
dh = dlogits @ W2.T
dW2 = h.T @ dlogits
db2 = dlogits.sum(0)
dhpreact = (1.0 - h**2) * dh
dbngain = (bnraw * dhpreact).sum(0, keepdim=True)
dbnraw = bngain * dhpreact
Output :
logprobs | exact: True | approximate: True | maxdiff: 0.0
probs | exact: True | approximate: True | maxdiff: 0.0
counts_sum_inv | exact: True | approximate: True | maxdiff: 0.0
counts_sum | exact: True | approximate: True | maxdiff: 0.0
counts | exact: True | approximate: True | maxdiff: 0.0
norm_logits | exact: True | approximate: True | maxdiff: 0.0
logit_maxes | exact: True | approximate: True | maxdiff: 0.0
logits | exact: True | approximate: True | maxdiff: 0.0
h | exact: True | approximate: True | maxdiff: 0.0
W2 | exact: True | approximate: True | maxdiff: 0.0
b2 | exact: True | approximate: True | maxdiff: 0.0
hpreact | exact: False | approximate: True | maxdiff: 4.656612873077393e-10
bngain | exact: False | approximate: True | maxdiff: 1.862645149230957e-09
bnbias | exact: False | approximate: True | maxdiff: 7.450580596923828e-09
bnraw | exact: False | approximate: True | maxdiff: 6.984919309616089e-10
bnvar_inv | exact: False | approximate: True | maxdiff: 3.725290298461914e-09
bnvar | exact: False | approximate: True | maxdiff: 9.313225746154785e-10
When I tried myself to re-implement the code for matrix count, I noticed that there is slight difference between my nll
loss and the one in notebook.
Here is what we have in video: -nll = 559891.75
but here what I got: -nll = 559873.5915061831
The difference is about 18
.
After reviewing both codes, I found out that the issue is with .item()
. It converts the tensor to python number.
In the notebook, this has been calculated as logprob
:
log_likelihood += logprob
But I have calculated this:
nll += logprob.item()
Here is example:
print(torch.log(prob).item())
print(torch.log(prob))
>>> -1.4469189643859863
>>> tensor(-1.4469)
In pytorch item, there is nothing mentioned about this.
First thanks a lot Andreij for the great series,
Here is the factorized version for dC
one_hot_Xb = F.one_hot(Xb,27).view((Xb.shape[0]*Xb.shape[1],27)).float()
dC = one_hot_Xb.T @ demb.view((demb.shape[0]*demb.shape[1],demb.shape[2]))
I came up with it mainly through shape comparaison, could you please look into it :-)
@karpathy
I get an error from this line B, T, C = x.shape
,
If I use more than 3 FlattenConsecutive layers
ValueError: not enough values to unpack (expected 3, got 2)
In notebook micrograd_lecture_first_half_roughly.ipynb
L.backward() is missing before creating graph draw_dot(L)
because of this gradient is not showing up.
When I run the makemore_part1_bigrams notebook, in the cell where we draw a single sample based on probability distribution in the first row of N
, I get a different sample ( 'c' ) compared to the one in the video ( 'm' ). Everything else until then seems to be the same, with manually seeded generator, I'd expect even this to match exactly.
Value in this repo (and the video):
I have pushed the full run until this cell in the commit here.
What am I missing?
I was looking for it and found it here
https://github.com/karpathy/makemore/blob/master/names.txt
under the makemore repo, but it is referenced from the nn-zero-to-hero repo, so I would expect to see it here as well.
n +=1 should be in the outer loop, I guess.
You want to count the words, not the characters.
A forum would be very nice, Youtube comments could do it, but it is full of nontechnical comments. Could we use pytorch forums maybe?
youtube video links for lectures 3, 4, 5 are broken..not sure if its just me facing the issue
Hi,
First of all, thanks for the great video tutorial! You describe in the video at 2:11:48, that you need to reset the grad
values to zero between iterations by assigning zero to the grad
value of every parameter.
I may be misunderstanding this part, but it seems to me that one would also need to zero the grad values for all nodes and not just for the ones representing the parameters, otherwise the grad values for the internal nodes between the parameters will still keep accumulating between iterations.
I may easily be mistaken, so I'm phrasing this issue as a question rather than as a bug report. Is it enough to zero_grad the parameters and if so, why?
Thanks!
When I run the following command from micrograd_lecture_second_half_roughly.ipynb:
[(yout - ygt)**2 for ygt, yout in zip(ys, ypred)]
I get a clean output:
[Value(data=1.8688676392069992)), Value(data=0.37661915959598025)), Value(data=0.13611849326555403)), Value(data=1.69235533142263))]
But whenever I tried to enclose it in a sum()
function:
sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
It throws an error:
TypeError: unsupported operand type(s) for +: 'int' and 'Value'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.