Giter Club home page Giter Club logo

Comments (7)

milantracy avatar milantracy commented on September 23, 2024 3

iirc @ayushr2 is working on the checkpoint with --nvproxy

from gvisor.

ayushr2 avatar ayushr2 commented on September 23, 2024 2

I am currently not interested in restoring GPU state

Got it. Yeah restoring GPU state is actually a very tough problem...

This checkpoint error should be an easy fix. (sent #9385)

But I think the restore is the hard part and might fail. Didn't test it yet, will get back to it next week. Also note that you will need to use --overlay2=none for checkpoint restore. The default value (--overlay2=root:self) does not support restore yet. I will fix that in the coming months, but for now please use --overlay2=none.

from gvisor.

luiscape avatar luiscape commented on September 23, 2024 2

I can verify this works. Thank you for the patch!

./runsc -nvproxy -nvproxy-docker -overlay2=none restore -image-path ./checkpoint restore-test
Fri Sep 15 22:01:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                 ERR! |
| N/A   33C    P8     9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
CompletedProcess(args=['nvidia-smi'], returncode=0)

The container was checkpointed only with:

./runsc -nvproxy -overlay2=none checkpoint -image-path ./checkpoint restore-test

from gvisor.

ayushr2 avatar ayushr2 commented on September 23, 2024 1

Oh, well that's wonderful. :)

from gvisor.

ayushr2 avatar ayushr2 commented on September 23, 2024 1

@luiscape FYI 2d90b66 should add S/R support for all overlay configurations, so you don't need --overlay2=none anymore.

from gvisor.

luiscape avatar luiscape commented on September 23, 2024

We would be happy to explore this issue and work on a patch. We are wondering if (a) you have any plans for making this kind of checkpointing viable or (b) have any guidance on adding a SaverLoader interface to nvproxy objects.

from gvisor.

luiscape avatar luiscape commented on September 23, 2024

Sounds good. Thanks for letting me know.

from gvisor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.