Giter Club home page Giter Club logo

Comments (10)

antoinebrl avatar antoinebrl commented on June 10, 2024 1

Thanks @b-chu and @mvpatel2000 for coming back to me and thanks for pushing a patch upstream!

Pretending to be on rank 1 by only changing RANK was enough for us to test different behaviors which don't require synchronization across ranks (for example for a subclass of LoggerDestination). I tried to mock all other variables related to torch.distributed but it didn't work because MASTER_ADDR and MASTER_PORT point to a service that is not existing. As @b-chu suggested, communication would have to be mocked too but this sounds non-trivial.

Looking at your Makefile, the distributed tests are launched using python -m composer.cli.launcher -n $(WORLD_SIZE) -m pytest. I think the distributed magic happens inside the composer.cli.launcher while you leverage pytest markers to skip some of the tests when you are not distributed setup. In our case, our CI is managed by an internal tool which calls pytest directly.

from composer.

mvpatel2000 avatar mvpatel2000 commented on June 10, 2024 1

@antoinebrl we are going to revert this change for the next release and work with torch team for a better solution here. Instead, we will delay switching APIs to torch 2.3. Thanks for raising this to us!

#2899

from composer.

antoinebrl avatar antoinebrl commented on June 10, 2024

@mvpatel2000 sorry I can't add the bug tag because this issue was automatically created by GitHub from a comment I made to a merged PR

from composer.

b-chu avatar b-chu commented on June 10, 2024
  • Do your tests work with Composer 0.17.2?
  • Are you using the same Pytorch version between Composer 0.17.2 and 0.18.0? Which torch version is it?

from composer.

b-chu avatar b-chu commented on June 10, 2024

For the first error, can you post the stack trace? How are you mocking environment variables? We set defaults so I wonder what gets logged here

log.debug(

from composer.

antoinebrl avatar antoinebrl commented on June 10, 2024

Hi @b-chu,
Yes, we were using Composer 0.17.2 so far. We have an automation bot to bump our dependencies so our CI captured the problem rapidly. Our approach worked with any PyTorch version >=2.0, we are currently using the latest 2.1.2.

The mocking of the environment variable is done using the mocker fixture in pytest. When we want to pretend we are on a different node, we change the RANK like so:

mocker.patch.dict("os.environ", {"RANK": "1"})

This is the stacktrace once the Trainer is instantiated:

.venv/lib/python3.10/site-packages/composer/trainer/trainer.py:989: in __init__
    dist.initialize_dist(device, dist_timeout)
.venv/lib/python3.10/site-packages/composer/utils/dist.py:527: in initialize_dist
    dist.init_process_group(device_obj.dist_backend, timeout=timeout_timedelta)
.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py:74: in wrapper
    func_return = func(*args, **kwargs)
.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:1141: in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
.venv/lib/python3.10/site-packages/torch/distributed/rendezvous.py:236: in _env_rendezvous_handler
    world_size = int(_get_env_or_raise("WORLD_SIZE"))
.venv/lib/python3.10/site-packages/torch/distributed/rendezvous.py:216: ValueError
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set

with this debugging statement from initialize_dist.

DEBUG    composer.utils.dist:dist.py:511 Initializing torch.dist: global_rank=1, local_rank=0, world_size=1, local_world_size=1, node_rank=0

Looking at the changelog, dist.initialize_dist() was not called in previous versions because we were in the case dist.get_world_size() == 1`:

-        if deepspeed_config is not None or fsdp_config is not None or dist.get_world_size() > 1:
-            # Deepspeed and FSDP both require torch.distributed to be initialized, even if the world size is 1
-            # And torch.distributed is always required for multi-rank training
-            dist.initialize_dist(device, dist_timeout)
+        dist.initialize_dist(device, dist_timeout)

The condition was removed to support DTensor but this prevent us from using the Trainer without launching using the composer command.

from composer.

b-chu avatar b-chu commented on June 10, 2024

It seems like the previous setup was incorrect, but we hadn't caught that error. By patching only rank, torch is getting confused because you're telling it you're using rank 1 (aka the second gpu) while at the same time saying you only have one gpu (since world size is 1).

In order to test a distributed setup without actually having a distributed setup, you would need to mock all the communication and env vars.

from composer.

b-chu avatar b-chu commented on June 10, 2024

Removing bug since this isn't an issue with Composer

from composer.

mvpatel2000 avatar mvpatel2000 commented on June 10, 2024

@antoinebrl

Apologies for the trouble here. As you pointed out, Composer now always calls dist.initialize_dist. I agree this is a bit annoying. This is because Torch has a new unified API for getting/setting model/optimizer state_dict, which for some reason requires dist to be initialized even for a single process. We are looking at upstreaming a fix to torch that does not require this. The new api comes out in torch 2.2.

Technically, this should still be valid in all scenarios, but it does cause problems if environment variables are not correctly configured. Iirc there are 3 options:

  • In this case, as @b-chu pointed out, I believe your rank var should be 0
  • Or, you can run composer -n 1 and it will set env vars for you
  • Or, you can set no env vars and composer will default initialize for just 1 rank.

Please feel free to reach out if none of these work. For what it's worth, we use the same setup as you for unit testing in Composer and it does work fine with the above.

I will leave this issue open until you can confirm it works!

from composer.

mvpatel2000 avatar mvpatel2000 commented on June 10, 2024

FYI we have yanked this release for unrelated checkpointing related reasons, but I'll see if we can avoid this in 0.18.1.

from composer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.