Comments (5)
Got it! Ok, it's all I could desire about checkpointing ahah.
I'll close the issue.
from labml.
I also opened a pull request to show what I mean. Obviously you designed this and you'll know way better than me what's going on under the hood, so I may not see every problem that comes with my approach. What do you think?
from labml.
Hey, thanks for raising this and the pull request. Your approach makes perfect sense.
The decision to save to numpy was done way back when lab was for Tensorflow, and I left it when we added PyTorch support and later dropped Tensorflow support (because none of the lab users seemed to be using Tensorflow).
I have personally being resetting optimizers (to save checkpoint sizes and because it didn't have a huge effect on the tasks I was working on) so I didn't come across this problem yet. Saving as state_dict
directly is both simpler and probably future proof to any changes PyTorch might make too.
I'll merge you request, ASAP. I just want to add some backward compatibility to load back old checkpoints saved in numpy formats. I will close the issue once we merge it.
Thanks again
from labml.
Great! In the checkpoint saving it could be needed to load also the current epoch/iteration. I think it could be saved in the .json file, but I don't know which could be the best practise to load that if we want to resume an interrupted training. Do you have anything in mind?
from labml.
Right now it picks the global step from the checkpoint. The checkpoint folder name is the step at which it was saved
from labml.
Related Issues (20)
- Implement Flash attention stable diffusion problems HOT 1
- Hardware Naming In Monitor HOT 1
- UnicodeEncodeError: 'gbk' codec can't encode character HOT 1
- Tracker bug: UnicodeEncodeError: 'charmap' codec can't encode characters HOT 1
- Running issue... HOT 1
- Columns and DataType Not Explicitly Set on line 133 of build_numpy_cache.py
- Failed to connect server HOT 3
- Add a new card in run view, if more than 4 metrics have the same prefix
- Move the spark lines to the top in detailed views
- Remove '.mean' suffix from metrics
- Save button in process details view
- Handle tracking data from multiple processes in distributed runs
- Forecast loss curve
- Smoothing in log scale
- 502 - Bad Gateway HOT 2
- 500 Error Issue HOT 3
- Update the project to support the latest version of weya HOT 1
- Feature request: Allow setting listen address on command line & infer URL from request HOT 1
- Some private runs visible on homepage before login or refresh HOT 1
- How can I restarted thread again? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from labml.