Comments (7)
iirc @ayushr2 is working on the checkpoint with --nvproxy
from gvisor.
I am currently not interested in restoring GPU state
Got it. Yeah restoring GPU state is actually a very tough problem...
This checkpoint error should be an easy fix. (sent #9385)
But I think the restore is the hard part and might fail. Didn't test it yet, will get back to it next week. Also note that you will need to use --overlay2=none
for checkpoint restore. The default value (--overlay2=root:self) does not support restore yet. I will fix that in the coming months, but for now please use --overlay2=none
.
from gvisor.
I can verify this works. Thank you for the patch!
./runsc -nvproxy -nvproxy-docker -overlay2=none restore -image-path ./checkpoint restore-test
Fri Sep 15 22:01:16 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | ERR! |
| N/A 33C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
CompletedProcess(args=['nvidia-smi'], returncode=0)
The container was checkpointed only with:
./runsc -nvproxy -overlay2=none checkpoint -image-path ./checkpoint restore-test
from gvisor.
Oh, well that's wonderful. :)
from gvisor.
@luiscape FYI 2d90b66 should add S/R support for all overlay configurations, so you don't need --overlay2=none
anymore.
from gvisor.
We would be happy to explore this issue and work on a patch. We are wondering if (a) you have any plans for making this kind of checkpointing viable or (b) have any guidance on adding a SaverLoader
interface to nvproxy
objects.
from gvisor.
Sounds good. Thanks for letting me know.
from gvisor.
Related Issues (20)
- runsc --platform=systrap fails with "panic: seccomp failed: invalid argument" HOT 3
- Problem in building gvisor on ARM64 HOT 4
- [Feature] Asking for support for termux on android(with termux-glibc) HOT 3
- NV2080_CTRL_CMD_GRMGR_GET_GR_FS_INFO: Missing nvproxy ioctl used by NCCL HOT 2
- feed does not validate HOT 1
- Restoring a checkpointed container with a different OCI spec HOT 8
- Mark C ABI structs with `structs.HostLayout`
- segfault: buffer.View possibly released twice resulting in nil chunk HOT 8
- /proc/sys/net/core/rmem_default is visible in non-root network namespaces in recent Linux kernels HOT 1
- //test/syscalls/linux:prctl_test fails to build on x86_64 host because of aarch64 dependencies HOT 2
- runsc: Duplicate container creation deletes the existing container and causes resources leak
- File descriptors not being closed on write to mountpoint-s3 HOT 16
- runsc (in docker): fork/exec /proc/self/exe: read-only file system HOT 5
- gVisor CNI tutorial is not working as expected
- Support no-op `personality(2)` bits
- Regression in recent version? error: setsockopt(..., IP_MTU_DISCOVER, IP_PMTUDISC_OMIT...) failed: Not supported HOT 5
- No obvious way to checkpoint a container when TCP sockets have been recently closed and are in TIME_WAIT state in the kernel HOT 2
- sysctl options declared in config.json not applied to container HOT 3
- Poor performance when switching to multiple CPU Cores HOT 7
- Runtime fails to mount /sys when --tpuproxy is provided HOT 26
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gvisor.