Giter Club home page Giter Club logo

Comments (8)

SamuelGabriel avatar SamuelGabriel commented on September 15, 2024 1

Thanks for the update, Andreas. The version in the requirements, should work then, I guess. :)

Yes, there is. We used submitit to run all our experiments, since we have a SLURM cluster.
Our parallelization is heavily inspired by this repo: https://github.com/facebookresearch/dino

If you have a SLURM cluster: You can make the train call with executor.submit you can simply update the parameters of ex to schedule a multi gpu job:

    executor.update_parameters(
        gpus_per_node=8,
        tasks_per_node=8,  # one task per GPU
    )

If not: launching with torchrun should also work out of the box, as I wrote some code to handle it, but I am not 100% sure, as we did not use this in a while.

The important code is here:

def init_dist(device):

The code does not support multi-node trainings, though.

from tabpfn.

SamuelGabriel avatar SamuelGabriel commented on September 15, 2024 1

from tabpfn.

amueller avatar amueller commented on September 15, 2024

Downgrading to seaborn 0.11 yields:

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I had changed device to 'cuda', changing it to 'cpu' makes it work.

from tabpfn.

amueller avatar amueller commented on September 15, 2024

another question: is there a way to do multi-gpu training using your scripts from the notebook you provide? I don't see any code to spawn workers, it looks like init_dist requires using torchrun?

from tabpfn.

amueller avatar amueller commented on September 15, 2024

Thanks for the update, Andreas. The version in the requirements, should work then, I guess. :)

Oh, I thought maybe the requirements file was consume by the setup.py as the installation instructions only mention the pip install. It would be great to have end-to-end instructions for reproducing the training.

Thanks for the pointer to sumitit, I'll check out how it works. I don't have a slurm cluster, I have a cloud ;) I'm currently using torchrun.

from tabpfn.

SamuelGabriel avatar SamuelGabriel commented on September 15, 2024

Did you get this far, installing from pip? I did not expect this to work tbh and thought one needs to install from requirements to train. I will add the requirement to the setup, thanks! :)

from tabpfn.

amueller avatar amueller commented on September 15, 2024

Oh yeah I didn't touch the requirements.txt, it wasn't mentioned anywhere.

I think adding requirements.txt to setup is a bad habit, but many people do it. Having maybe one section for installing for using the model and one for reproducing the training would be great.

from tabpfn.

SamuelGabriel avatar SamuelGabriel commented on September 15, 2024

from tabpfn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.