The latest release ( 0.18.0 ) breaks for us. We see two

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Do your tests work with Composer 0.17.2? Are you using the same Pytorch

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Calling the `Trainer` without using the `composer` comand in `0.18` about composer HOT 10 CLOSED

antoinebrl commented on June 10, 2024

Calling the `Trainer` without using the `composer` comand in `0.18`

from composer.

Comments (10)

antoinebrl commented on June 10, 2024 1

Thanks @b-chu and @mvpatel2000 for coming back to me and thanks for pushing a patch upstream!

Pretending to be on rank 1 by only changing RANK was enough for us to test different behaviors which don't require synchronization across ranks (for example for a subclass of LoggerDestination). I tried to mock all other variables related to torch.distributed but it didn't work because MASTER_ADDR and MASTER_PORT point to a service that is not existing. As @b-chu suggested, communication would have to be mocked too but this sounds non-trivial.

Looking at your Makefile, the distributed tests are launched using python -m composer.cli.launcher -n $(WORLD_SIZE) -m pytest. I think the distributed magic happens inside the composer.cli.launcher while you leverage pytest markers to skip some of the tests when you are not distributed setup. In our case, our CI is managed by an internal tool which calls pytest directly.

from composer.

mvpatel2000 commented on June 10, 2024 1

@antoinebrl we are going to revert this change for the next release and work with torch team for a better solution here. Instead, we will delay switching APIs to torch 2.3. Thanks for raising this to us!

#2899

from composer.

antoinebrl commented on June 10, 2024

@mvpatel2000 sorry I can't add the bug tag because this issue was automatically created by GitHub from a comment I made to a merged PR

from composer.

b-chu commented on June 10, 2024

Do your tests work with Composer 0.17.2?
Are you using the same Pytorch version between Composer 0.17.2 and 0.18.0? Which torch version is it?

from composer.

b-chu commented on June 10, 2024

For the first error, can you post the stack trace? How are you mocking environment variables? We set defaults so I wonder what gets logged here

composer/composer/utils/dist.py

Line 511 in 35015b7

log.debug(

from composer.

antoinebrl commented on June 10, 2024

Hi @b-chu,
Yes, we were using Composer 0.17.2 so far. We have an automation bot to bump our dependencies so our CI captured the problem rapidly. Our approach worked with any PyTorch version >=2.0, we are currently using the latest 2.1.2.

The mocking of the environment variable is done using the mocker fixture in pytest. When we want to pretend we are on a different node, we change the RANK like so:

mocker.patch.dict("os.environ", {"RANK": "1"})

This is the stacktrace once the Trainer is instantiated:

.venv/lib/python3.10/site-packages/composer/trainer/trainer.py:989: in __init__
    dist.initialize_dist(device, dist_timeout)
.venv/lib/python3.10/site-packages/composer/utils/dist.py:527: in initialize_dist
    dist.init_process_group(device_obj.dist_backend, timeout=timeout_timedelta)
.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:74: in wrapper
    func_return = func(*args, **kwargs)
.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1141: in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
.venv/lib/python3.10/site-packages/torch/distributed/rendezvous.py:236: in _env_rendezvous_handler
    world_size = int(_get_env_or_raise("WORLD_SIZE"))
.venv/lib/python3.10/site-packages/torch/distributed/rendezvous.py:216: ValueError
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set

with this debugging statement from initialize_dist.

DEBUG    composer.utils.dist:dist.py:511 Initializing torch.dist: global_rank=1, local_rank=0, world_size=1, local_world_size=1, node_rank=0

Looking at the changelog, dist.initialize_dist() was not called in previous versions because we were in the case dist.get_world_size() == 1`:

-        if deepspeed_config is not None or fsdp_config is not None or dist.get_world_size() > 1:
-            # Deepspeed and FSDP both require torch.distributed to be initialized, even if the world size is 1
-            # And torch.distributed is always required for multi-rank training
-            dist.initialize_dist(device, dist_timeout)
+        dist.initialize_dist(device, dist_timeout)

The condition was removed to support DTensor but this prevent us from using the Trainer without launching using the composer command.

from composer.

b-chu commented on June 10, 2024

It seems like the previous setup was incorrect, but we hadn't caught that error. By patching only rank, torch is getting confused because you're telling it you're using rank 1 (aka the second gpu) while at the same time saying you only have one gpu (since world size is 1).

In order to test a distributed setup without actually having a distributed setup, you would need to mock all the communication and env vars.

from composer.

b-chu commented on June 10, 2024

Removing bug since this isn't an issue with Composer

from composer.

mvpatel2000 commented on June 10, 2024

@antoinebrl

Apologies for the trouble here. As you pointed out, Composer now always calls dist.initialize_dist. I agree this is a bit annoying. This is because Torch has a new unified API for getting/setting model/optimizer state_dict, which for some reason requires dist to be initialized even for a single process. We are looking at upstreaming a fix to torch that does not require this. The new api comes out in torch 2.2.

Technically, this should still be valid in all scenarios, but it does cause problems if environment variables are not correctly configured. Iirc there are 3 options:

In this case, as @b-chu pointed out, I believe your rank var should be 0
Or, you can run composer -n 1 and it will set env vars for you
Or, you can set no env vars and composer will default initialize for just 1 rank.

Please feel free to reach out if none of these work. For what it's worth, we use the same setup as you for unit testing in Composer and it does work fine with the above.

I will leave this issue open until you can confirm it works!

from composer.

mvpatel2000 commented on June 10, 2024

FYI we have yanked this release for unrelated checkpointing related reasons, but I'll see if we can avoid this in 0.18.1.

from composer.

Calling the `Trainer` without using the `composer` comand in `0.18` about composer HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent