Comments (8)
Thanks for the update, Andreas. The version in the requirements, should work then, I guess. :)
Yes, there is. We used submitit
to run all our experiments, since we have a SLURM cluster.
Our parallelization is heavily inspired by this repo: https://github.com/facebookresearch/dino
If you have a SLURM cluster: You can make the train
call with executor.submit
you can simply update the parameters of ex to schedule a multi gpu job:
executor.update_parameters(
gpus_per_node=8,
tasks_per_node=8, # one task per GPU
)
If not: launching with torchrun
should also work out of the box, as I wrote some code to handle it, but I am not 100% sure, as we did not use this in a while.
The important code is here:
Line 238 in f02c093
The code does not support multi-node trainings, though.
from tabpfn.
from tabpfn.
Downgrading to seaborn 0.11 yields:
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
I had changed device to 'cuda', changing it to 'cpu' makes it work.
from tabpfn.
another question: is there a way to do multi-gpu training using your scripts from the notebook you provide? I don't see any code to spawn workers, it looks like init_dist
requires using torchrun
?
from tabpfn.
Thanks for the update, Andreas. The version in the requirements, should work then, I guess. :)
Oh, I thought maybe the requirements file was consume by the setup.py as the installation instructions only mention the pip install. It would be great to have end-to-end instructions for reproducing the training.
Thanks for the pointer to sumitit
, I'll check out how it works. I don't have a slurm cluster, I have a cloud ;) I'm currently using torchrun
.
from tabpfn.
Did you get this far, installing from pip? I did not expect this to work tbh and thought one needs to install from requirements to train. I will add the requirement to the setup, thanks! :)
from tabpfn.
Oh yeah I didn't touch the requirements.txt, it wasn't mentioned anywhere.
I think adding requirements.txt to setup is a bad habit, but many people do it. Having maybe one section for installing for using the model and one for reproducing the training would be great.
from tabpfn.
from tabpfn.
Related Issues (20)
- Error in cross entropy loss calculation 160/train.py
- model does not support gradient calculation if no_grad = False? HOT 1
- What purpose does remove_models_from_memory serve? HOT 1
- ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) HOT 1
- PytorchStreamReader failed reading zip archive: failed finding central directory
- Trying to understand hyper-parameter curation
- Bag prior weights HOT 2
- How to get row/sample embeddings? HOT 1
- Potential issue in sampling choice variables HOT 1
- Meaning of mix_activations
- remove_outliers sets imbalanced categorical features to constants HOT 1
- training tabpfn HOT 1
- Best practices - your code HOT 1
- The variable "single_eval_position" has a chance of being zero. HOT 2
- Some doubts about TABPFN Training in Section E.3 of the paper HOT 3
- Innovating on the model HOT 1
- EnsembleTabPFN for bigger than 1000 rows dataset HOT 1
- UserWarning from torch.utils.checkpoint
- path_ = 'prior_tuning_result.pkl',Where did this document come from??
- Duplicated datasets in test
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabpfn.