Comments (10)
Thanks @b-chu and @mvpatel2000 for coming back to me and thanks for pushing a patch upstream!
Pretending to be on rank 1 by only changing RANK
was enough for us to test different behaviors which don't require synchronization across ranks (for example for a subclass of LoggerDestination
). I tried to mock all other variables related to torch.distributed
but it didn't work because MASTER_ADDR
and MASTER_PORT
point to a service that is not existing. As @b-chu suggested, communication would have to be mocked too but this sounds non-trivial.
Looking at your Makefile
, the distributed tests are launched using python -m composer.cli.launcher -n $(WORLD_SIZE) -m pytest
. I think the distributed magic happens inside the composer.cli.launcher
while you leverage pytest markers to skip some of the tests when you are not distributed setup. In our case, our CI is managed by an internal tool which calls pytest
directly.
from composer.
@antoinebrl we are going to revert this change for the next release and work with torch team for a better solution here. Instead, we will delay switching APIs to torch 2.3. Thanks for raising this to us!
from composer.
@mvpatel2000 sorry I can't add the bug
tag because this issue was automatically created by GitHub from a comment I made to a merged PR
from composer.
- Do your tests work with Composer 0.17.2?
- Are you using the same Pytorch version between Composer 0.17.2 and 0.18.0? Which torch version is it?
from composer.
For the first error, can you post the stack trace? How are you mocking environment variables? We set defaults so I wonder what gets logged here
composer/composer/utils/dist.py
Line 511 in 35015b7
from composer.
Hi @b-chu,
Yes, we were using Composer 0.17.2
so far. We have an automation bot to bump our dependencies so our CI captured the problem rapidly. Our approach worked with any PyTorch version >=2.0
, we are currently using the latest 2.1.2
.
The mocking of the environment variable is done using the mocker
fixture in pytest
. When we want to pretend we are on a different node, we change the RANK
like so:
mocker.patch.dict("os.environ", {"RANK": "1"})
This is the stacktrace once the Trainer
is instantiated:
.venv/lib/python3.10/site-packages/composer/trainer/trainer.py:989: in __init__
dist.initialize_dist(device, dist_timeout)
.venv/lib/python3.10/site-packages/composer/utils/dist.py:527: in initialize_dist
dist.init_process_group(device_obj.dist_backend, timeout=timeout_timedelta)
.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:74: in wrapper
func_return = func(*args, **kwargs)
.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1141: in init_process_group
store, rank, world_size = next(rendezvous_iterator)
.venv/lib/python3.10/site-packages/torch/distributed/rendezvous.py:236: in _env_rendezvous_handler
world_size = int(_get_env_or_raise("WORLD_SIZE"))
.venv/lib/python3.10/site-packages/torch/distributed/rendezvous.py:216: ValueError
raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
with this debugging statement from initialize_dist
.
DEBUG composer.utils.dist:dist.py:511 Initializing torch.dist: global_rank=1, local_rank=0, world_size=1, local_world_size=1, node_rank=0
Looking at the changelog, dist.initialize_dist()
was not called in previous versions because we were in the case dist.get_world_size() == 1`:
- if deepspeed_config is not None or fsdp_config is not None or dist.get_world_size() > 1:
- # Deepspeed and FSDP both require torch.distributed to be initialized, even if the world size is 1
- # And torch.distributed is always required for multi-rank training
- dist.initialize_dist(device, dist_timeout)
+ dist.initialize_dist(device, dist_timeout)
The condition was removed to support DTensor
but this prevent us from using the Trainer
without launching using the composer
command.
from composer.
It seems like the previous setup was incorrect, but we hadn't caught that error. By patching only rank, torch is getting confused because you're telling it you're using rank 1 (aka the second gpu) while at the same time saying you only have one gpu (since world size is 1).
In order to test a distributed setup without actually having a distributed setup, you would need to mock all the communication and env vars.
from composer.
Removing bug
since this isn't an issue with Composer
from composer.
Apologies for the trouble here. As you pointed out, Composer now always calls dist.initialize_dist
. I agree this is a bit annoying. This is because Torch has a new unified API for getting/setting model/optimizer state_dict, which for some reason requires dist to be initialized even for a single process. We are looking at upstreaming a fix to torch that does not require this. The new api comes out in torch 2.2.
Technically, this should still be valid in all scenarios, but it does cause problems if environment variables are not correctly configured. Iirc there are 3 options:
- In this case, as @b-chu pointed out, I believe your rank var should be 0
- Or, you can run
composer -n 1
and it will set env vars for you - Or, you can set no env vars and composer will default initialize for just 1 rank.
Please feel free to reach out if none of these work. For what it's worth, we use the same setup as you for unit testing in Composer and it does work fine with the above.
I will leave this issue open until you can confirm it works!
from composer.
FYI we have yanked this release for unrelated checkpointing related reasons, but I'll see if we can avoid this in 0.18.1.
from composer.
Related Issues (20)
- Forward metric name to `update_metric` HOT 7
- AttributeError on DeviceMPS HOT 2
- Using streaming and composer HOT 1
- `DataSpec` can't be used with `StreamingDataset` when `world_size>1`
- CUDA error: initialization error during distributed training when num_workers > 0 in eval_dataloader and metrics are not None HOT 2
- WandBLogger throws an Exception in any mode except 'online' HOT 4
- is AMD + RoCM available? HOT 1
- state_dict_type: sharded does not saves all shards to s3 HOT 16
- HuggingFace Pre-trained model Fine Tuning Error HOT 2
- Stuck at train.fit() and print nothing HOT 3
- eval drops when i have multi_eval dataset HOT 5
- Access state.outputs: RuntimeError: CUDA error: an illegal memory access was encountered HOT 2
- Crash when training Llama 2 from HF on ROCm HOT 3
- InContextLearning*Dataset Default padding sides hardcoded? HOT 1
- WandBLogger init is a hot mess HOT 2
- Bump `torchmetrics` HOT 2
- `0.18.1` not available in PyPI HOT 3
- Saving checkpoints on network drive fails due to symlinks HOT 1
- Add support for Pytorch 2.2.0 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from composer.