Giter Club home page Giter Club logo

Comments (10)

dptam avatar dptam commented on September 2, 2024

Hello,

To clarify, are you loading the model at step 67? Is the performance of the model when you load the checkpoint 53? And is the performance of the checkpoint in the log 58?

from t-few.

CaffreyR avatar CaffreyR commented on September 2, 2024

Hi @dptam, the step is actually 75. As we see from the log here, in line 20(epoch 19), the log is 0.5812
image
And when I enter this code, it run accuracy is 0.5848
image

BTW step 79 is 0.5631

Same thing in COPA dataset , line 221 is 0.62
image
So when I tried to run step 883 887 it result is 0.54 , and 879 is 0.55
image
image

from t-few.

dptam avatar dptam commented on September 2, 2024

I'm not sure the issue. If you don't mind, could you rerun and add self.global_step to the metrics dictionary here. This should output the global step in the log that matches the global step used to save the model just to make sure the line number corresponds to the correct ckpt.

from t-few.

CaffreyR avatar CaffreyR commented on September 2, 2024

Hi @dptam , actually when I tried to run the finish.pt, it can not match the last accuracy in log.

image

image

image

from t-few.

CaffreyR avatar CaffreyR commented on September 2, 2024

Is there something wrong with the code? @muqeeth @jmohta @HaokunLiu

from t-few.

CaffreyR avatar CaffreyR commented on September 2, 2024

@dptam I have add global step as your suggestion, but it still can not match
image
image
image

from t-few.

HaokunLiu avatar HaokunLiu commented on September 2, 2024

What is in pl_test.py? Mind share with us what you have there?

from t-few.

dptam avatar dptam commented on September 2, 2024

Hello,

Thanks for rerunning the code. I'm still not sure why loading and rerunning the model doesn't match the log performance - could you share the command used to train the model?

Regarding the issue of finish.pt not matching the last accuracy in log, see #11 for more details why.

from t-few.

CaffreyR avatar CaffreyR commented on September 2, 2024

Hi @HaokunLiu @dptam , actually pl_test is just a copy of train, except for loading method. See I was use both your save model method and checkpoing method of pytorch ligetning. See,
image
And I change a little bit in encoderdecoder.py
image

But here is the thing, the train command is as bellow
image

And the test code is as bellow, actually pl_train/test run the same result
image

And the log here, not use finish.pt but the 51 as suggestion of @dptam
image

from t-few.

dptam avatar dptam commented on September 2, 2024

Hi,

I tried to look into a bit and couldn't figure out the cause but found one issue for me at least(not sure if it will be the same for you). Sorry I don't have more time to look into it currently, but maybe you can.

When using t5-small and printing out self.model.lm_head.weight(), the norm is 94070 in the train_step function but 94072 in the predict function. This is due some precision issues when moving from CPU to GPU and one remedy was adding
self.weight = torch.clone(self.model.lm_head.weight).double().cuda().float() at the end of the init function for EncoderDecoder.py and adding self.model.lm_head.weight = torch.nn.Parameter(self.weight) at the beginning of the training_step function.

This causes the self.model.lm_head.weight() to consistently be 94070 in the train_step and predict function, but the accuracy from the log and from loading a validation checkpoint still do not match. I'm not sure why, but one potential further analysis is to look at the other weights of the model.

from t-few.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.