pytorch / torchx Goto Github PK

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

Home Page: https://pytorch.org/torchx

License: Other

Python 99.31% Shell 0.66% Dockerfile 0.03%

pytorch machine-learning kubernetes slurm distributed-training pipelines components deep-learning python aws-batch

torchx's Introduction

TorchX

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

TorchX currently supports:

Kubernetes (EKS, GKE, AKS, etc)
Slurm
AWS Batch
Docker
Local
Ray (prototype)
GCP Batch (prototype)

Need a scheduler not listed? Let us know!

Quickstart

See the quickstart guide.

Documentation

Requirements

torchx:

python3 (3.8+)
PyTorch
optional: Docker (needed for docker based schedulers)

Certain schedulers may require scheduler specific requirements. See installation for info.

Installation

Stable

# install torchx sdk and CLI -- minimum dependencies
pip install torchx

# install torchx sdk and CLI -- all dependencies
pip install "torchx[dev]"

# install torchx kubeflow pipelines (kfp) support
pip install "torchx[kfp]"

# install torchx Kubernetes / Volcano support
pip install "torchx[kubernetes]"

# install torchx Ray support
pip install "torchx[ray]"

# install torchx GCP Batch support
pip install "torchx[gcp_batch]"

Nightly

# install torchx sdk and CLI
pip install torchx-nightly[dev]

Source

# install torchx sdk and CLI from source
$ pip install -e git+https://github.com/pytorch/torchx.git#egg=torchx

# install extra dependencies
$ pip install -e git+https://github.com/pytorch/torchx.git#egg=torchx[dev]

Docker

TorchX provides a docker container for using as as part of a TorchX role.

See: https://github.com/pytorch/torchx/pkgs/container/torchx

Contributing

We welcome PRs! See the CONTRIBUTING file.

License

TorchX is BSD licensed, as found in the LICENSE file.

torchx's People

Contributors

Stargazers

Watchers

Forkers

muskanmahajan37 aivanou d4l3k rmadamson shunsunsun spector-in-london gaocegege sirlegolot bowangbj grievejia barry-jin ldworkin classicvalues daniellepintz travistagh amitani dahsh stroxler dracifer hakanardo hadarohana mbrukman dragonxlwang kurman chrisjwiebe scottilee ntlm1686 dongreenberg ananthsub mthrok yukunlin aptsunny priyaramani jiayongmeta jarnorfb tryweirdier rnaimehaom vishakha-ramani amogkam takeshi-yoshimura armbiant hdt91 trellixvulnteam chenyin0126 jimmywhitaker straitrobot j-l ajunlonglive ashvinnihalani liuwenchao sara-ks project-codeflare podtserkovskiy komodoai kbhat1 kpostoffice stoicio miqueljubert kunalb rohankumardubey alito vara-bonthu manav-a patrick-lee-app schmidt-ai clumsy vanillaspoon alanhdu luxonis gunchu zsol bobyangyf dvad nitiotasid ccharest93 dkajtoch ishachirimar yikaimeta cniii bobbins228 ryxli ttsugriy connernilsen inseokhwang dgrove-oss mdevino andywag asuta274 amyreese gag1jain hstonec

torchx's Issues

[torchx/config] Enable scheduler_params override to .torchxconfig

Description

Enable scheduler_params (these are ctor params to the scheduler objects NOT the run configs) to be set via .torchxconfig file.

Motivation/Background

we allow users to set default scheduler runcfg in .torchxconfig. See: https://pytorch.org/torchx/latest/experimental/runner.config.html. The ask is to also enable scheduler_params (see example: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/local_scheduler.py#L925) via .torchxconfig.

Detailed Proposal

Add a section as [{scheduler_name}.params]:

[local_cwd.params]
cache_size = 10

Alternatives

N/A

Additional context/links

N/A

kubernetes_scheduler: volcano doesn't support Kubernetes 1.22

🐛 Bug

Kubernetes hello world does not run, instead returns KeyError when requesting description and I don't think ever launches (does not appear in jobs). There is also a typo in the example (torchx/schedulers/kubernetes_scheduler:ln245) should be --scheduler_args not --scheduler_opts.

kubectl get jobs -A
NAMESPACE        NAME                     COMPLETIONS   DURATION   AGE
volcano-system   volcano-admission-init   1/1           3s         115m

Module (check all that applies):

To Reproduce

Steps to reproduce the behavior:

Install Kubernetes 1.22 and gpu-operator (following https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#step-0-before-you-begin) (I also added one extra node to the cluster)
Install Volcano kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.3.0/installer/volcano-development.yaml
git clone https://github.com/pytorch/torchx.git && cd torchx && sudo python3 -m pip install -e . (Kubernetes not on pypi yet)
torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=test utils.echo --msg hello

torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=test utils.echo --msg hello
kubernetes://torchx_monash/default:echo-8mgdh
=== RUN RESULT ===
Launched app: kubernetes://torchx_monash/default:echo-8mgdh
Traceback (most recent call last):
  File "/usr/local/bin/torchx", line 33, in <module>
    sys.exit(load_entry_point('torchx', 'console_scripts', 'torchx')())
  File "/home/monash/torchx/torchx/cli/main.py", line 62, in main
    args.func(args)
  File "/home/monash/torchx/torchx/cli/cmd_run.py", line 120, in run
    status = runner.status(app_handle)
  File "/home/monash/torchx/torchx/runner/api.py", line 294, in status
    desc = scheduler.describe(app_id)
  File "/home/monash/torchx/torchx/schedulers/kubernetes_scheduler.py", line 342, in describe
    status = resp["status"]
KeyError: 'status'

Expected behavior

Return without error and can retrieve status or description of job.

Environment

torchx version (e.g. 0.1.0rc1): master
Python version: 3.8.10
OS (e.g., Linux): Ubuntu 20.04
How you installed torchx (conda, pip, source, docker):
Docker image and tag (if using docker):
Git commit (if installed from source): 037e716
Execution environment (on-prem, AWS, GCP, Azure etc): on-prem
Any other relevant information:

Additional context

My usecase is 4 workstations with 2 gpus each and doing distributed or shared training amonst a small university research group. I'm trying to just get this hello-world working before I start trying to run my distributed code which I have working on a single node system with torch.distributed.run --standalone (args...).

[local_docker] Distributed jobs do not work with default config

Distributed jobs with local_docker scheduler do not work. Each docker container gets allocated with default network bridge, that makes them not available to communicate between each other.
Also, the name and hostname is not assigned to each docker container. This means that c10d rendezvous cannot be used since we cannot provide a know master address.

Proposal.

The proposal is to add a new network option to the local scheduler runopts. If set, this option will be passed to docker as --net parameter . This will make sure that all containers launched by the local_docker will be launched in the same network and all ports will be available between them.

Add --name and --hostname to the docker command with value : ${APP_NAME}-${ROLE_NAME}-${REPLICA_ID}.
This make sure that each docker container will have a unique hostname that can be referred in the command line, e.g.


torchx run -s local_docker\
 ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
--nnodes 2 --nproc_per_node 2 \
--rdzv_backend c10d \
--rdzv_endpoint cv-trainer-worker-0:29500

[torchx/spec] Make docstring optional for components.

Description

In my component function, make docstring optional. So that I can define components as:

def echo(
    msg: str = "hello world", image: str = TORCHX_IMAGE, num_replicas: int = 1
) -> specs.AppDef:
    return specs.AppDef(
        name="echo",
        roles=[
            specs.Role(
                name="echo",
                image=image,
                entrypoint="/bin/echo",
                args=[msg],
                num_replicas=num_replicas,
            )
        ],
    )

instead of the current

def echo(
    msg: str = "hello world", image: str = TORCHX_IMAGE, num_replicas: int = 1
) -> specs.AppDef:
    """
    Echos a message to stdout (calls /bin/echo)

    Args:
        msg: message to echo
        image: image to use
        num_replicas: number of replicas to run

    """
    return specs.AppDef(
        name="echo",
        roles=[
            specs.Role(
                name="echo",
                image=image,
                entrypoint="/bin/echo",
                args=[msg],
                num_replicas=num_replicas,
            )
        ],
    )

Motivation/Background

Currently torchx uses the component function's docstring to auto create argparse.ArgumentParser for the component function. This is only useful in the context of the CLI where we pass through component args fro the CLI as (in the case of echo above):

$ torchx run utils.echo --help # <- prints the help string of def echo() not torchx run
$ torchx run utils.echo --msg "foobar" <-- --msg is passed as "msg" to echo(msg)

While it is a good practice to add docstrings to components (especially if they are going to be shared), it makes testing things out with torchx extremely tedious. Moreover the docstring is not used at all when running the component programmatically since the component itself is just a python function that the user calls as they could call any other function.

Detailed Proposal

The proposal is to make the docstring optional and make a change to the file_linter.py to argparse the name, type, and default of the component's arguments from the function signature rather than the docstring (current behavior).

What we could lose out in this case is that the help message would be absent so running:

$ torchx run utils.echo --help

would simply return some canned placeholders instead of meaningful help strings for each argument.

I also propose that the placeholders just be of the form f"({arg_type})" as such:

usage: torchx run ...torchx_params... echo  [-h] [--msg MSG] [--image IMAGE] [--num_replicas NUM_REPLICAS]

optional arguments:
  -h, --help            show this help message and exit
  --msg MSG             (str)
  --image IMAGE         (str)
  --num_replicas (int)

if the doc string were present then this would look like

usage: torchx run ...torchx_params... echo  [-h] [--msg MSG] [--image IMAGE] [--num_replicas NUM_REPLICAS]

App spec: Echos a message to stdout (calls /bin/echo)

optional arguments:
  -h, --help            show this help message and exit
  --msg MSG             message to echo
  --image IMAGE         image to use
  --num_replicas NUM_REPLICAS
                        number of replicas to run

which is more informative but not necessary to run correctly

Alternatives

N/A

Additional context/links

Improve docs page toctree index

📚 Documentation

Link

https://pytorch.org/torchx

What does it currently say?

No issues with the documentation. This calls for a revamped indexing of the toctree in the torchx docs page

What should it say?

Make the toctree be:

Usage:
- Basic Concepts
- Installation
- 10 Min Tutorial (Hello World should be renamed to this)
Examples:
- Application
- Component -> (links to) list of builtins (4. below)
- Pipelines
Best Practices:
- Application
- Component
Application (Runtime)
- Overview
- HPO
- Tracking
Components
- Train
- Distributed
- ...
Runner (Schedulers)
- Localhost
- Kubernetes
- Slurm
Pipelines
- Kubeflow
API
- torchx.specs
- torchx.runner
- torchx.schedulers
- torchx.pipelines
Experimental
- (beta) torchx.config

Why?

The proposed toctree is a better organization compared to the one we have today. It better organizes parallels between app, component, and piplines. Logically lays out sections so that it reads better top to bottom

torchx run --scheduler local utils.echo --msg "hello world" crashes

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

conda create -n torchx python=3.8
pip install torchx
torchx run --scheduler local utils.echo --help -> works fine
torchx run --scheduler local utils.echo --msg "hello world" -> Crashes

{"session": "", "scheduler": "local", "api": "schedule", "app_id": null, "runcfg": "{\"image_type\": \"dir\", \"log_dir\": null}", "raw_exception": "Traceback (most recent call last):\n  File \"/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/runner/api.py\", line 202, in schedule\n    app_id = sched.schedule(dryrun_info)\n  File \"/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/schedulers/local_scheduler.py\", line 543, in schedule\n    replica = self._popen(role_name, replica_id, replica_params)\n  File \"/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/schedulers/local_scheduler.py\", line 485, in _popen\n    proc = subprocess.Popen(\n  File \"/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/subprocess.py\", line 858, in __init__\n    self._execute_child(args, executable, preexec_fn, close_fds,\n  File \"/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/subprocess.py\", line 1705, in _execute_child\n    raise child_exception_type(err_msg)\nsubprocess.SubprocessError: Exception occurred in preexec_fn.\n", "source": "<unknown>"}
Traceback (most recent call last):
  File "/Users/marksaroufim/miniconda3/envs/torchx/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/cli/main.py", line 57, in main
    args.func(args)
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/cli/cmd_run.py", line 93, in run
    result = runner.run_component(
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/runner/api.py", line 142, in run_component
    return self.run(app, scheduler, cfg)
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/runner/api.py", line 164, in run
    return self.schedule(dryrun_info)
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/runner/api.py", line 202, in schedule
    app_id = sched.schedule(dryrun_info)
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/schedulers/local_scheduler.py", line 543, in schedule
    replica = self._popen(role_name, replica_id, replica_params)
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/site-packages/torchx/schedulers/local_scheduler.py", line 485, in _popen
    proc = subprocess.Popen(
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/Users/marksaroufim/miniconda3/envs/torchx/lib/python3.8/subprocess.py", line 1705, in _execute_child
    raise child_exception_type(err_msg)
subprocess.SubprocessError: Exception occurred in preexec_fn.

Expected behavior

Not crash

Environment

torchx version (e.g. 0.1.0rc1): Latest
Python version: 3.8
OS (e.g., Linux): Mac OS X
How you installed torchx (conda, pip, source, docker): conda
Docker image and tag (if using docker):
Git commit (if installed from source): master
Execution environment (on-prem, AWS, GCP, Azure etc): Local
Any other relevant information:

[torchx.runner] Support Kubernetes

Description

Implement a k8s_scheduler for the torchx.runner to allow launching torchx.AppDefs as a standalone job on Kubernetes.

Motivation/Background

Currently torchx-0.1.0.dev only has support for launching onto a "local" scheduler:

# only this is supported
torchx run --scheduler local ...

This is only useful for local validation and testing, we need a way to run components onto a real scheduler.

NOTE: Running on Kubeflow Pipelines is supported for pipelines, we are specifically talking about launching as a job (not a workflow/pipeline).

Detailed Proposal

Add k8s support so that this is possible:

# CLI
torchx run --scheduler k8s ...

Alternatives

N/A

Additional context/links

N/A

[file-linter] Refactor file-linter and components-finder

File linter and components finder modules reside under torchx.specs module. Components finder module has runtime dependency on the torchx.components module: during find execution, components finder would dynamically load the module and all its submodule to search components. This creates circular dependency between torchx.specs and torchx.components, since torchx.componetns depends on the torchx.specs.

We need to move these modules out of the torchx.specs , and extract hardcoded module search parameter from the components finder

[torchx.runner] Add support for SLURM

Description

Add support for a slurm_scheduler for torchx runner (not pipelines!)

Motivation/Background

torchx-0.1.0.dev only supports running locally

Detailed Proposal

Should be able to run a component as a standalone SLURM job as such:

# CLI
torchx run --scheduler slurm ...

Alternatives

N/A

Additional context/links

N/A

[torchx/examples] Fix classy vision trainer example on local, multi-node, gpu

🐛 Bug

https://github.com/pytorch/torchx/blob/main/torchx/examples/apps/lightning_classy_vision/train.py
does not work on local scheduler with multiple nodes and when gpus are available. (works on cpus however).

This is because the local scheduler does NOT mask gpus by setting CUDA_VISIBLE_DEVICES on for each
replica. Which is expected behavior since the local scheduler DOES NOT do any type of resource isolation. This makes the example cv trainer not compatible with local_cwd (or any other variant of local scheduler) when running multiple replicas.

Interestingly you can still work around it by specifying --nnodes=1 --nproc_per_node 8 instead of --nnodes=2 --nproc_per_node=4 or --nnodes=4 --nproc_per_node=2.

Module (check all that applies):

To Reproduce

Steps to reproduce the behavior:

ON A HOST WITH GPU

 torchx run -s local_cwd ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist
   --nnodes 2 # <- anything greater than 1
  --nproc_per_node 2 
  --rdzv_backend c10d 
  --rdzv_endpoint localhost:29500

Expected behavior

Should work

Environment

torchx version (e.g. 0.1.0rc1):
Python version:
OS (e.g., Linux):
How you installed torchx (conda, pip, source, docker):
Docker image and tag (if using docker):
Git commit (if installed from source):
Execution environment (on-prem, AWS, GCP, Azure etc):
Any other relevant information:

Additional context

[torchx.pipelines] Add an Airflow pipeline adapter

Description

Add support for airflow pipeline adapter for torchx AppDefs.

Motivation/Background

torchx-0.1.0.dev only supports Kubeflow Pipelines

Detailed Proposal

Essentially add: https://github.com/pytorch/torchx/tree/master/torchx/pipelines/kfp for airflow

Alternatives

N/A

Additional context/links

N/A

cli: support fetching logs from all roles

Description

Currently you have to specify which role you want to fetch logs when using torchx log. Ideally you could just specify the job name to fetch all of them.

torchx log kubernetes://torchx_tristanr/default:sh-hxkkr/sh

Motivation/Background

This reduces friction for users with single role jobs when trying to fetch logs. It's very common I forget to add the role and then have to run the command again with the role. There's no technical limitation here and it removes friction for the user.

Detailed Proposal

This would require updating the log CLI to support iterating over all roles and fetching logs from all the replicas. https://github.com/pytorch/torchx/blob/master/torchx/cli/cmd_log.py#L81

This doesn't require any changes to the scheduler implementations and is purely a CLI improvement.

Alternatives

We could instead change the CLI to automatically select the role when there's only one role in a job. That would improve the UX a fair bit while also preventing tons of log spam for complex jobs.

Additional context/links

[RFC] A collect_env.py script to automatically get system diagnostics

Description

A script to get and print diagnostic information for users who are filing TorchX bug reports.

Motivation/Background

PyTorch has a script that users can use to copy paste system diagnostics info to append to the bug report issues that they file with pytorch. We should have something similar for TorchX.

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py

Detailed Proposal

Write a collect_env.py for TorchX that outputs (list non-exhaustive) relevant info that will help us accurately debug user issues:

torchx version
kfp, slurm (scheduler) version
docker version
platform
which CSP (aws, gcp, azure or on-prem) - we might not be able to collect this automatically, instead this might just be a user input

NOTE: we can actually expand on this idea and write a "diagnostics job" that users can launch that returns information about cluster setup (e.g. that the network settings are correctly configured by trying to connect to nodes in the distributed job).

Alternatives

Instead of a script, have the user fill out a template manually (not optimal since its inconvenient and cumbersome and also leads to human errors (typos etc))

Additional context/links

see pytorch's bug report template: https://github.com/pytorch/pytorch/issues/new?assignees=&labels=&template=bug-report.md

cli: fails to find malformed components

🐛 Bug

If you try to load a component without a fully correct definition the CLI will throw an error about not being able to find it. In those cases we should indicate to the user that it was found but the component definition is incorrect and the errors that were found.

Module (check all that applies):

To Reproduce

Steps to reproduce the behavior:

comment out one of the arg descs in components/utils.py.


def echo(
    msg: str = "hello world", image: str = "/tmp", num_replicas: int = 1
) -> specs.AppDef:
    """
    Echos a message to stdout (calls /bin/echo)

    Args:
        msg: message to echo
        #image: image to use
        num_replicas: number of replicas to run

    """
    return specs.AppDef(
        name="echo",
        roles=[
            specs.Role(
                name="echo",
                image=image,
                entrypoint="/bin/echo",
                args=[msg],
                num_replicas=num_replicas,
            )
        ],
    )

try to run it:

tristanr@tristanr-arch2 ~/D/torchx (dockerver)> torchx run --scheduler local utils.echo
Traceback (most recent call last):
  File "/home/tristanr/.local/bin/torchx", line 33, in <module>
    sys.exit(load_entry_point('torchx', 'console_scripts', 'torchx')())
  File "/home/tristanr/Developer/torchx/torchx/cli/main.py", line 62, in main
    args.func(args)
  File "/home/tristanr/Developer/torchx/torchx/cli/cmd_run.py", line 97, in run
    result = runner.run_component(
  File "/home/tristanr/Developer/torchx/torchx/runner/api.py", line 135, in run_component
    raise ValueError(
ValueError: Component `utils.echo` not found. Please make sure it is one of the builtins: `torchx builtins`. Or registered via `[torchx
.components]` entry point (see: https://pytorch.org/torchx/latest/configure.html)

Expected behavior

This should print the errors in the component instead of printing a not found error.

Environment

torchx version (e.g. 0.1.0rc1): master
Python version: 3.9
OS (e.g., Linux): Linux
How you installed torchx (conda, pip, source, docker): source
Docker image and tag (if using docker):
Git commit (if installed from source):
Execution environment (on-prem, AWS, GCP, Azure etc):
Any other relevant information:

Additional context

[local_docker] Faulty termination logic for local docker scheduler

Local scheduler has single termination behavior: send SIGTERM or/and SIGKILL to the processes that is manages.
This does not work with DockerImageProvider. When we use docker we have to use docker close $container_name to actually stop the container. If we just SIGTERM docker spawn process, it will cause undefined behavior. In some cases, it actually messes up the terminal that launched the job.

Proposal.

There are several ways to solve the issue:

Make new LocalDockerScheduler that redefines necessary methods
Add new _LocalAppDef that overrides the kill method with docker-specific implementation
Add logic in the _LocalAppDef.kill and make it scheduler-based

Installation from source examples fail

📚 Documentation

Link

https://github.com/pytorch/torchx#source

What does it currently say?

# install torchx sdk and CLI from source
$ pip install -e git+https://github.com/pytorch/torchx.git

What should it say?

No idea.

Why?

On Linux:

(venv) sbyan % pip --version
pip 21.2.4 from /mnt/shared_ad2_mt1/sbyan/git/PrivateFederatedLearning/venv/lib64/python3.6/site-packages/pip (python 3.6)
(venv) sbyan % pip install -e git+https://github.com/pytorch/torchx.git
ERROR: Could not detect requirement name for 'git+https://github.com/pytorch/torchx.git', please specify one with #egg=your_package_name

On MacOS:

(venv) smb % pip --version
pip 21.2.4 from /Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/pip (python 3.9)
(venv) smb % pip install -e git+https://github.com/pytorch/torchx.git
ERROR: Could not detect requirement name for 'git+https://github.com/pytorch/torchx.git', please specify one with #egg=your_package_name

Update builtin components to use best practices + documentation

Before stable release we want to do some general cleanups on the current built in components.

all components should default to docker images (no /tmp)
all components should use python -m entrypoints to make it easier to support all environments by using python's resolution system
update the component best practice documentation to indicate above

Slurm image handling will be revisited later to make it easier to deal with virtualenvs and the local paths.

[torchx][api] Deprecate Torchx Session Name

Torchx Runner has a concept of session_name. When user creates a Runner, he/she may provide session_name. If no argument provided, the session_name will be set to: torchx_$user .

This parameter is then used to create a full job name as: session_name-$app_der.name .

The problem is that the default session name torchx_$user is not bound by the max length, and in theory can be very long. This may affect user jobs without them understanding the root cause.

In order to solve this, we can make default session name as torchx_ instead, and omit $user part.

run --scheduler local utils.echo --mesg fails using commit 8486264c

🐛 Bug

Module (check all that applies):

To Reproduce

Steps to reproduce the behavior:

pip install -e git+https://github.com/pytorch/torchx.git#egg=torchx
torchx run --scheduler local utils.echo --msg "Hellow world"

(venv) smb % pip install -e git+https://github.com/pytorch/torchx.git#egg=torchx 
Obtaining torchx from git+https://github.com/pytorch/torchx.git#egg=torchx
  Cloning https://github.com/pytorch/torchx.git to ./venv/src/torchx
  Running command git clone -q https://github.com/pytorch/torchx.git /Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx
  Resolved https://github.com/pytorch/torchx.git to commit 8486264c622b9d6d04bbb23e3c0b2387b24f5a46
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Requirement already satisfied: pyre-extensions in ./venv/lib/python3.9/site-packages (from torchx) (0.0.22)
Requirement already satisfied: docstring-parser==0.8.1 in ./venv/lib/python3.9/site-packages (from torchx) (0.8.1)
Requirement already satisfied: pyyaml in ./venv/lib/python3.9/site-packages (from torchx) (5.4.1)
Requirement already satisfied: typing-inspect in ./venv/lib/python3.9/site-packages (from pyre-extensions->torchx) (0.6.0)
Requirement already satisfied: typing-extensions in ./venv/lib/python3.9/site-packages (from pyre-extensions->torchx) (3.10.0.0)
Requirement already satisfied: mypy-extensions>=0.3.0 in ./venv/lib/python3.9/site-packages (from typing-inspect->pyre-extensions->torchx) (0.4.3)
Installing collected packages: torchx
  Running setup.py develop for torchx
Successfully installed torchx-0.1.0rc0
(venv) smb % torchx run --scheduler local utils.echo --help              
usage: torchx run ...torchx_params... echo  [-h] [--msg MSG] [--image IMAGE]
                                            [--num_replicas NUM_REPLICAS]

App spec: Echos a message to stdout (calls /bin/echo)

optional arguments:
  -h, --help            show this help message and exit
  --msg MSG             message to echo
  --image IMAGE         image to use
  --num_replicas NUM_REPLICAS
                        number of replicas to run
(venv) smb % torchx run --scheduler local utils.echo --msg "Hellow world"
Traceback (most recent call last):
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/bin/torchx", line 33, in <module>
    sys.exit(load_entry_point('torchx', 'console_scripts', 'torchx')())
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/cli/main.py", line 62, in main
    args.func(args)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/cli/cmd_run.py", line 157, in run
    self._run(runner, args)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/cli/cmd_run.py", line 118, in _run
    result = runner.run_component(
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/runner/api.py", line 161, in run_component
    return self.run(app, scheduler, cfg)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/runner/api.py", line 179, in run
    dryrun_info = self.dryrun(app, scheduler, cfg)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/runner/api.py", line 269, in dryrun
    dryrun_info = sched.submit_dryrun(app, cfg or RunConfig())
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/schedulers/api.py", line 125, in submit_dryrun
    dryrun_info = self._submit_dryrun(app, resolved_cfg)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/schedulers/local_scheduler.py", line 648, in _submit_dryrun
    request = self._to_popen_request(app, cfg)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/schedulers/local_scheduler.py", line 687, in _to_popen_request
    img_root = image_provider.fetch(role.image)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/src/torchx/torchx/schedulers/local_scheduler.py", line 141, in fetch
    raise ValueError(
ValueError: Invalid image name: ghcr.io/pytorch/torchx:0.1.0rc0, does not exist or is not a directory

Expected behavior

I expected the command to output the string Hellow world

Environment

torchx version (e.g. 0.1.0rc1): 0.1.0rc0
Python version: 3.9.6
OS (e.g., Linux): MacOS 10.15.7
How you installed torchx (conda, pip, source, docker): pip install -e git+https://github.com/pytorch/torchx.git#egg=torchx
Docker image and tag (if using docker):
Git commit (if installed from source):
Execution environment (on-prem, AWS, GCP, Azure etc): on-prem
Any other relevant information:

Additional context

[torchx.component] Implement an HPO component + app

Description

Support HPO as a builtin component.

Motivation/Background

Detailed Proposal

The HPO app needs to be implemented (TODO -> attach a detailed design doc RFC), and an appropriate AppDef should be authored and placed in torchx.components.hpo so that it is available as a builtin.

The HPO optimizer itself can be one of: botorch (https://botorch.org/) or ax (https://ax.dev/).

Alternatives

N/A

Additional context/links

N/A

[named_resources] simplify API and support case insensitive resources

Named resources needs a bit of love

remove either get_resources or named_resources["foo"] to ensure only one entry path
make named_resources case insensitive
allow setting named resources from .torchxconfig

Add Torchx Validate command

Description

Torchx allows users to develop their own components. Torchx component is defined as a python function with several restrictions as described in https://pytorch.org/torchx/latest/quickstart.html#defining-your-own-component

The torchx validate cmd will help users to develop the components.

torchx validate ~/my_component.py:func whether the component is a valid component or not. If component is not valid, the command will print the detailed message explaining what is wrong with the function.

[RFC] TorchX 0.1stable Features

Purpose: share the current list of features for torchx-0.1stable.

IMPORTANT: features presented here are subject to change or get pushed
to future releases.

Features are grouped into several categories:

Platform: schedulers, pipelines, and infra that TorchX supports.
Component: builtin components/apps
Stack: TorchX core features
Experimental: Beta features

Platform

Support the following schedulers for torchx.runner
1. Localhost (already available in 0.1rc)
2. SLURM (already available in 0.1rc)
3. Kubernetes (already available in 0.1rc)
Support the following pipeline platforms for torchx.pipelines
1. Kubeflow Pipelines (already available in 0.1rc)
2. TBD (potential candidate Apache Airflow)

Component

Support the following built-in components:
1. Serve - deploy model to torchserve (already available in 0.1rc)
2. DDP - distributed data parallel (already available in 0.1rc)
  - note: only supported on select schedulers in rc
3. Hyperparameter Sweep
4. Data Preprocessing (will be pushed to 0.2)
Support the following components as examples:
- note: these are components that require heavy user customization that
  it does not make sense to offer them as builtins
- please refer to our examples page for references to the components
  already available in 0.1rc
1. Data Importer - simple "download from web save to s3" available in 0.1rc
2. Training - classy vision + pytorch_lightning trainer available in 0.1rc
3. Fine-tuning - offered as part the trainer example above
4. Model Evaulation - model interpretation w/ Captum availalbe in 0.1rc

Stack

torchx.tracker module (pending design RFC)
- better handling of input/outputs from components beyond CMD style args
- integrate into the platform's existing metadata/input/output tracking APIs
- use for tagging RAI metadata

Experimental

N/A - as of yet

components: tensorboard component

Description

It would be nice to have a tensorboard component that could be used as either a mixin as a new role or standalone. This would make it easy to launch a job and monitor it while it's running.

Detailed Proposal

Add a new built in component tensorboard to components/metrics.py. This would provide a component with the interface:

def tensorboard(logdir: string, duration: float, image: string = "<image>"): 
   """
   Args:
      duration: number of hours to run the container for
   """

Lifetime

There's a bit of consideration here on how to manage the lifetime of the tensorboard role. Ideally it would be tied to the other containers but practically we can't support that on most schedulers. Launching it as a standalone component with a fixed duration i.e. 8 hours is likely going to be the best supported and should be good enough. Tensorboard is quite lightweight so having it run longer than necessary shouldn't be a big deal.

There may be better ways of handling this though. Volcano allows for flexible policies and we could allow for containers that get killed on first sucessful (0 exit code) replica.

It also could be good to watch a specific file. tensorboard uses a remote path so we could add in a watch_file arg with a specific path that the manager can long poll on to detect shutdown. The app would have to know to write out a foo://bar/done or foo://bar/model.pt that the component can poll on for termination purposes.

fsspec

One other painpoint is that tensorboard uses it's own filesystem interface that has relatively view implementations. It is extensible but other components use fsspec which could cause confusion for users.

There is an issue about this on tensorboard but it's quite new tensorflow/tensorboard#5165

We could write our own fsspec tensorboard adapter if necessary and provide it as part of a custom docker image.

Docker images

There's not a specific docker image we can use to provide tensorboard right now. It's possible to use tensorflow/tensorflow but that doesn't contain boto3 so no s3 support or other file systems. We may want to provide our own cutdown tensorboard container that can be used with the component.

Role

We also want to provide tensorboard as a role so you can have it run as a companion to the main training job. You can then easily include the tensorboard role as an extra role in your AppDef and use it as is.

Alternatives

Currently you can launch tensorboard via KFP UI or via the command line. This requires an extra step and in the case of KFP you can only do that after the job has run.

Additional context/links

[torchx.pipelines] Add Argo Workflows pipeline adapter

Description

Add support for adapting torchx.AppDefs to Argo's "Template" (https://argoproj.github.io/argo-workflows/workflow-concepts/)

Motivation/Background

torchx-0.1.0dev only support Kubeflow pipelines.

Detailed Proposal

Add https://github.com/pytorch/torchx/tree/master/torchx/pipelines/kfp for Argo Workflows

Alternatives

N/A

Additional context/links

N/A

[kubernetes] Pods with wrong image getting stuck forever

🐛 Bug

Module (check all that applies):

When job started with image that does not exist, it is getting stuck forever. We need to provide better experience in propagating the errors to the users via torchx.

Repro:

torchx run --scheduler kubernetes --scheduler_args namespace=default,queue=default examples/apps/dist_cifar/component.py:trainer --image dummy_image --rdzv_backend=etcd-v2 --rdzv_endpoint=etcd-server:2379 --nnodes 2 -- --epochs 1 --output_path s3://torchx-test/aivanou


torchx status $job

The second command will always show pending

[torchx/docker] update base image to pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

Description

Update base image in https://github.com/pytorch/torchx/blob/main/torchx/runtime/container/Dockerfile#L1 from
pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime to pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime.

Motivation/Background

We currently use the 1.9.0 base image because we've been running our CI with the assumption of cuda10.2 (we don't explicitly require cuda 10.2, we've always built the docker this way hence its an "assumption" not a requirement)

Detailed Proposal

See description

Alternatives

N/A

Additional context/links

N/A

separate .torchxconfig for fb/ and oss

Description

We want to have a FB internal .torchxconfig file to specify scheduler_args for internal cluster and a OSS .torchxconfig file to run on public clusters

Motivation/Background

Detailed Proposal

Alternatives

Additional context/links

[docs] add context/intro to each docs page

📚 Documentation

Link

Ex: https://pytorch.org/torchx/main/basics.html

and some other pages

What does it currently say?

doesn't currently have an intro about the page and how it fits in context, just jumps right into the documentation

What should it say?

Why?

We got some good feedback from the documentation folks about adding context to each page so if someone gets linked to it they're not totally lost. This matches some of the user feedback we've received so would be good to update this.

[torchx/cli] Fix mis-indented multi-line info log for `torchx run`

🐛 Bug

There are several cases that I've identified (there may be more) where the torchx run cli has some poorly aligned multi-line log messages. See screenshot below:

Module (check all that applies):

To Reproduce

Steps to reproduce the behavior:

torchx run -s local_cwd --dryrun utils.echo
torchx run -s local_cwd utils.echo

Expected behavior

multi-line log messages should NOT be misaligned

Environment

N/A

Additional context

[torchx/configs] Make runopts, Runopt, RunConfig, scheduler_args more consistent

Description

Consolidate redundant names, classes, and arguments that represent scheduler RunConfig.

Motivation/Background

Currently there are different names for what essentially ends up being the additional runtime options for the torchx.scheduler (see dryrun(..., cfg: RunConfig)).

This runconfig is:

has class type torchx.specs.api.RunConfig (dataclass)
function argument name cfg or runcfg in most places in the scheduler and runner source code
passed from the torchx run cli as --scheduler_args

Additionally each scheduler has what is called a runopts, which are the runconfig options that the scheduler advertises and takes (see runopts for local_scheduler).

The difference between RunConfig and runopts is that the RunConfig object is simply a holder for the user-provided config key-value pairs while runopts is the schema (type, default, is_required, help string) of the configs that it takes. Think of runopts being the argparse.ArgumentParser of the Scheduler if it were a cli tool, and RunConfig the sys.argv[1:] (but instead of an array it is a map).

Detailed Proposal

The proposal is to clean up the nomenclature as follows:

Deprecate --scheduler_args option in torchx cli and instead call it --cfg (consistent with the parameter names in the Scheduler API).
Change the section name In the runner INI config files from [$profile.scheduler_args.$sched_name] to [$profile.$scheduler_name.cfg] (e.g. [default.scheduler_args.local_cwd] would become [default.local_cwd.cfg])
Rename Runopt to runopt (to be consistant with runopts which is a holder for runopt by name)

Alternatives

(not really an alternative but other deeper cleanups considered)

changing the cfg parameter name in Scheduler and Runner interfaces to be runconfig (consistent with RunConfig) or alternatively changing RunConfig to RunCfg. This is going to be a huge codemod, hence I've decided to live with it and change the rest of the settings to match cfg.
RunConfig is simply a wrapper around a regular python Dict[str, ConfigValue] (ConfigValue is a type alias not an actual class) and does not provide any additional functionality on top of the dict other than a prettyprint __repr__(). Considering just dropping the RunConfig dataclass and using Dict[str, ConfigValue] directly (also requires a huge codemod)

Additional context/links

See hyperlinks above.

workspaces (patch/canary) -- overlay local changes on top of image and deploy to remote

Description

The current workflow for developing a component with Docker based schedulers requires manually building the Docker image every time there's a change. This is an extra layer of friction and requires significant Docker knowledge and maintenance. It would be nice to provide a way to do this via the torchx CLI to reduce user friction.

Detailed Proposal

This is going to add the concept of "workspaces". These workspaces look like file systems and can be implemented as one via fsspec. This can either map to an on-disk project with a .torchxconfig or in memory for use with notebooks.

This requires adding a workspace support to the runner and schedulers. There'll be a couple of standard patching implementations, a stub one for local, one for docker, etc.

Programmatic Experience

For programmatic access we need to implement a high level concept of a workspace. For CLI commands this will map to the project folder (i.e. with the .torchxconfig). For programmatic, it's a bit more abstract in order to support the notebook workspaces #344

Workspace basically just track files. This means that we can potentially use any fsspec filesystem interface to find and build files. When using Docker, we need to build a tarball of the local files to upload the context. fsspec provides a clean interface to be able to find all files and tarball them.

from torchx.runner import get_runner

app: specs.AppDef = ...

runner = get_runner()
runner.run(app, "kubernetes", workspace="file:///home/d4l3k/my_project")

For things like notebooks we can use an in memory file system:

runner.run(app, "kubernetes", workspace="memory://torchx-notebook/")

CLI Experience

Before:

$ docker build -t repo.sh/my_image:my_tag .
$ docker push repo.sh/my_image:my_tag
$ torchx run -s kubernetes dist.ddp --image repo.sh/my_image:my_tag my_trainer.py

After:

# in folder w/ .torchxconfig
$ torchx run -s kubernetes -cfg push=repo.sh/my_image dist.ddp my_trainer.py

This canary syntax with Docker can work with local_docker, kubernetes and potentially Ray.

Docker

Docker supports layering so we just have to create a small Dockerfile such as

FROM ghcr.io/pytorch/torchx:0.1.1dev0 # the specified image
COPY . .

We'll walk the workspace and upload all files as the Docker context when building.

For local running we can have to build it and use the local tag. For remote we need a repository to push it to. We can default to pushing to the same one the package is specifies and use the image hash as the label. This will be an extra run config required to override if they're building off of a standard Docker images such as the provided torchx one and can be specified in the .torchxconfig file.

Alternatives

There's some question about whether we only want to support Docker or should use buildah instead since that seems to be the more robust option. https://github.com/containers/buildah However, for now it seems like users are most familiar with Docker and buildah provides a compatible API so for maximum support we can use the existing Docker API.

For small components we can inline the file via the existing python component. This will work for many things but not everything.

This also doesn't quite address how we can do the same thing on Slurm. Though we could potentially support Docker on slurm which would be interesting. Slurm does support OCI images so we should potentially migrate towards that as a first class support from TorchX.

Additional context/links

Docker Python SDK: https://docker-py.readthedocs.io/en/stable/images.html

[torchx.components] Add support for distributed AppDefs

Description

Add support for running AppDefs that specify num_replicas > 1 and/or that have multiple Roles.

Motivation/Background

Currently distributed app defs are only supported on the local scheduler (each replica/role modeled as a UNIX process).

Detailed Proposal

TODO add detailed proposal

Alternatives

N/A

Additional context/links

N/A

[torchx.tracker] (new module) Implement torchx.tracker

Description

TorchX tracker is a module to allow:

Smoother input/output passing between AppDefs (without relying on the pipeline/scheduler specific features)
(the above) makes it possible to author app binaries that are infra/platform-agnostic.
Tracker should support an output of app_1 being the "launch parameters" (e.g. num replicas, resource requirements, etc) of the next app_2.

Motivation/Background

TODO update.

Detailed Proposal

TODO update with RFC

Alternatives

TODO update with RFC

Additional context/links

N/A

components: copy component

Description

Adds a basic copy io component that uses fsspec to allow ingressing data or copying from one location to another.

Motivation/Background

We previously had a simple copy component using the old Python style component classes. We deleted that since it didn't use the new style component definitions.

Having a copy component is generally useful and we should use it for data ingress in the KFP advanced example.

Detailed Proposal

io.py

def copy(from: string, to: string, image: string = "") -> specs.AppDef: ...

Alternatives

Additional context/links

[torchx/devX] Create a linter command for non-fb developers

Description

Create a linter command that lints my current clone of torchx.

Motivation/Background

Currently our scripts/lint.sh is designed to run as a github action. Hence it will:

clone torchx github repo
run the linters: usort, black, flake8, copyright, pyre

This isn't too useful if you are a developer with torchx already checked out and you'd like to lint your code to fix lint errors locally before submitting a PR.

Detailed Proposal

Either:

Make the existing scripts/lint.sh take an argument (e.g. --dir ~/workspace/torchx) to lint all the files in my current torchx workspace
-- or -- create a new lint script aimed at developers who want to lint their code before submitting a PR.

Alternatives

Do nothing and hurt developer experience since the dev has to submit the PR -> wait for the actions to run -> if there is a lint error fix it -> rinse and repeat.

Additional context/links

N/A

[torchx/cli] Repurpose session name as profiles in the cli and on the runner

Description

Currently we only support "default" profiles for torchx scheduler_args configs.
It would be nice for users to be able to add more profiles so that they can store different default
scheduler_args as:

[default.local_cwd.cfg]
log_dir=/tmp

[localdevserver.scheduler_args.local_cwd]
log_dir=/home/kiuk/log

[prod.local_cwd.cfg]
log_dir=/var/logs

Motivation/Background

Enables things like

$ torchx run -p localdevserver --scheduler local_cwd ~/component.py:train

Detailed Proposal

We can actually repurpose session_name which is hooked everywhere from the runner all the way down to the scheduler and rename it as profile. See: https://github.com/pytorch/torchx/blob/main/torchx/runner/api.py#L557

Profiles should NOT be hierarchical - e.g if the config file does not contain a section matching the profile, then an error should be raised rather than loading the "default" profile. This is because the config files are already loaded in a hierarchy ($HOME/.torchxconfig, $CWD/.torchxconfig), which would make it confusing if we add profile hierarchies.

Alternatives

N/A

Additional context/links

See: torchx/runner/config.py and torchx/runner/api.py:Runner.dryrun

run --scheduler local utils.echo --mesg fails

🐛 Bug

Module (check all that applies):

To Reproduce

Steps to reproduce the behavior:

install torchx
execute Quickstart utils.echo --msg example

(venv) smb % pip install torchx
Collecting torchx
  Using cached torchx-0.1.0b0-py3-none-any.whl (90 kB)
Requirement already satisfied: docstring-parser==0.8.1 in ./venv/lib/python3.9/site-packages (from torchx) (0.8.1)
Requirement already satisfied: pyre-extensions in ./venv/lib/python3.9/site-packages (from torchx) (0.0.22)
Requirement already satisfied: typing-inspect in ./venv/lib/python3.9/site-packages (from pyre-extensions->torchx) (0.6.0)
Requirement already satisfied: typing-extensions in ./venv/lib/python3.9/site-packages (from pyre-extensions->torchx) (3.10.0.0)
Requirement already satisfied: mypy-extensions>=0.3.0 in ./venv/lib/python3.9/site-packages (from typing-inspect->pyre-extensions->torchx) (0.4.3)
Installing collected packages: torchx
Successfully installed torchx-0.1.0b0
(venv) smb % torchx run --scheduler local utils.echo --help              
usage: torchx run ...torchx_params... echo  [-h] [--msg MSG]

App spec: Echos a message to stdout (calls /bin/echo)

optional arguments:
  -h, --help  show this help message and exit
  --msg MSG   message to echo
(venv) smb % torchx run --scheduler local utils.echo --msg "Hellow world"
{"session": "", "scheduler": "local", "api": "schedule", "app_id": null, "runcfg": "{\"image_type\": \"dir\", \"log_dir\": null}", "raw_exception": "Traceback (most recent call last):\n  File \"/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/runner/api.py\", line 202, in schedule\n    app_id = sched.schedule(dryrun_info)\n  File \"/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/schedulers/local_scheduler.py\", line 543, in schedule\n    replica = self._popen(role_name, replica_id, replica_params)\n  File \"/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/schedulers/local_scheduler.py\", line 485, in _popen\n    proc = subprocess.Popen(\n  File \"/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py\", line 951, in __init__\n    self._execute_child(args, executable, preexec_fn, close_fds,\n  File \"/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py\", line 1822, in _execute_child\n    raise child_exception_type(err_msg)\nsubprocess.SubprocessError: Exception occurred in preexec_fn.\n", "source": "<unknown>"}
Traceback (most recent call last):
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/bin/torchx", line 8, in <module>
    sys.exit(main())
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/cli/main.py", line 57, in main
    args.func(args)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/cli/cmd_run.py", line 93, in run
    result = runner.run_component(
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/runner/api.py", line 142, in run_component
    return self.run(app, scheduler, cfg)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/runner/api.py", line 164, in run
    return self.schedule(dryrun_info)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/runner/api.py", line 202, in schedule
    app_id = sched.schedule(dryrun_info)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/schedulers/local_scheduler.py", line 543, in schedule
    replica = self._popen(role_name, replica_id, replica_params)
  File "/Users/smb/Work/GIT/PrivateFederatedLearning/venv/lib/python3.9/site-packages/torchx/schedulers/local_scheduler.py", line 485, in _popen
    proc = subprocess.Popen(
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/Cellar/[email protected]/3.9.6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 1822, in _execute_child
    raise child_exception_type(err_msg)
subprocess.SubprocessError: Exception occurred in preexec_fn.

Expected behavior

I expected to see the command output the string Hellow world

Environment

torchx version (e.g. 0.1.0rc1): torchx-0.1.0b0
Python version: 3.9.6
OS (e.g., Linux): MacOS 10.15.7
How you installed torchx (conda, pip, source, docker): pip, in a virtual environment
Docker image and tag (if using docker):
Git commit (if installed from source):
Execution environment (on-prem, AWS, GCP, Azure etc): on-prem
Any other relevant information:

Additional context

I tried installing the latest torchx from source, but I get a different error. I'll submit another issue about that.

[torchx/cli] Implement a torchx "template" subcommand that copies the given builtin

Torchx cli maintains a list of builtin components that are available via torchx builtin cmd. The builtin components are the patterns that are configured to execute one or another use-case. Users can use these components without the need to manage their own, e.g.

torchx run -s local_cwd dist.ddp --script main.py

would run user main.py script in a distributed manner.

It is better for users to own their own components for production use-cases.
Torch copy command enables users to create initial templetized version of their components from the existing builtin components. Users then can modify the code however they want.

Example of usage

# torchx/components/dist.py

def ddp(..., nnodes=1):
   return AppDef(..., roles=[Role(name="worker", num_replicas=nnodes)])

torchx copy dist.ddp  

# Output:


def ddp(..., nnodes=1):
   return AppDef(..., roles=[Role(name="worker", num_replicas=nnodes)])

Torchx copy will print the corresponding component to the stdout, so users can inspect the source code and copy it via:

torchx copy dist.ddp > my_component.py

[torchx/specs] Add a metadata map to specs.Role

Description

Add a metadata (Dict[str, str]) parameter to specs.Role

Motivation/Background

Currently, any properties of the AppDef that is scheduler specific (and hence not exposed in the parameters of AppDef or Role) need to either be exposed as a scheduler runopt or users have to use the dryrun() api to manipulate the scheduler request prior to submission.

This is suboptimal as:

there are properties intrinsic to the AppDef/Role(e.g. oncall handle) which are being exposed as scheduler runopts as a workaround
there is no good way to specify different values for different roles via runopts. Typically runopts are applied on the whole job not to a specific Role.
dryrun + manual field injection is not a great UX and exposes too much of the scheduler details making the launch process tightly coupled to the scheduler

Detailed Proposal

Add a metadata: Dict[str, str] to Role. Each scheduler can read specific properties of the role (sub-gang) from this metadata. Hence an application that has multiple Roles and wants to set different oncall handles per role can do:

AppDef(
  name="multi-role",
  roles = [
      Role(
          name="trainer",
          metadata={"oncall": "[email protected]"},
          # ...
      ), 
      Role(
          name="reader",
          metadat={"oncall": "[email protected]"},
          # ....
      ),
   ],
)

Alternatives

The two known workarounds have been developed as an alternative but have proven to be suboptimal for reasons explained above.

Additional context/links

N/A

[Kubernetes] Restrict final pod name to ~200 symbols

We need to restrict the max pod name produced by the torchx kubernetes-scheduler to the max length allowed length. If we do not do this, it would be hard to debug these kind of jobs.

split local scheduler into different ones per image type

Currently the local scheduler has two supported paths depending on image type, docker and process. We should split these into two separate schedulers so each scheduler is tied to a single image type to reduce complexity and make it easier to work out of the box.

Proposed schedulers:

local_docker
local_cwd - overrides image with the current working directory for local testing purposes

In combination with #184 this should make all builtin components work as expected out of the box with remote docker and local docker/path based schedulers. For slurm you'll still have to manually override the image for now.

specs.api.from_function handles var args incorrectly when combined with defaults

🐛 Bug

Args are incorrectly mapped when combining arguments with defaults with varargs.

Module (check all that applies):

[x ] torchx.spec

To Reproduce

Steps to reproduce the behavior:

def bash(
    *args: str, image: str = "/tmp", num_replicas: int = 1
) -> specs.AppDef:
    """
    Runs the provided command via bash.

    Args:
        args: bash arguments
        image: image to use
        num_replicas: number of replicas to run
    """
    print(locals())

$ torchx run --scheduler kubernetes --wait --scheduler_args queue=test utils.bash --image alpine:latest --num_replicas 3 env
{'image': '/tmp', 'num_replicas': 1, 'args': ('alpine:latest', 3, 'env')}

Expected behavior

Should print

{'image': 'alpine:latest', 'num_replicas': 3, 'args': ('env',)}

Environment

torchx version (e.g. 0.1.0rc1): master
Python version: 3.9
OS (e.g., Linux): linux
How you installed torchx (conda, pip, source, docker): setup.py
Docker image and tag (if using docker):
Git commit (if installed from source): 09e1ea1
Execution environment (on-prem, AWS, GCP, Azure etc): cli
Any other relevant information:

Additional context

[torchx/runner] Split run_component into run_component and dryrun_component

Description

Make runner.run_component(..., dryrun: bool) follow the run() vs dryrun() convention instead of it returning a Union[AppHandle, AppDryrunInfo] based on dryrun= True or False.

See: https://github.com/pytorch/torchx/blob/main/torchx/runner/api.py#L101

Motivation/Background

A couple reasons why we should do this:

Plays nicely with pyre typechecker. Since the return type of run_component() depends on the dryrun parameter, pyre will always raise a type check error unless we manually type check the return value, which is redundant for the programmer since the human knows that when dryrun=False we'll always get AppHandle and when dryrun=True we'll always get AppDryrunInfo (but the machine doesn't know that).
Be consistent with run() and dryrun() APIs of the runner.
Good practice. Functions typically should do ONE thing. Its bad enough that run_component() does two different things based on the dryrun flag. To make matters worse, the RETURN type of run_component() is different too.

Detailed Proposal

Create a dryrun_component() -> AppDryrunInfo that DRYRUNs a component
Make run_component() -> AppHandle RUN a component
We can use function composition to make run_component() be implemented as:

def run_component(...):
    dryrun_info = dryrun_component(...)
    return schedule(dryrun_info)

Alternatives

Alternatively we can delegate run_component() and dryrun_component() to their run(AppDef) and dryrun(AppDef) counterparts as such:

def run_component(...):
    component = get_component()
    appdef = component.fn(...) # eval to get AppDef
    return run(appdef)

def run_component(...):
    component = get_component()
    appdef = component.fn(...) # eval to get AppDef
    return dryrun(appdef)

Additional context/links

N/A

local_scheduler: there's no way to fetch the stdout logs

🐛 Bug

There's currently no way to fetch the stdout logs via the programmatic interface. This is problematic when running from bento as you can only view stderr when many simple train scripts use print(...).

Module (check all that applies):

[runner/specs] deprecate RunConfig in favor of Dict[str, ConfigValue]

Currently RunConfig is just a thin wrapper around a Dict. There's no plans to extend that further so we should just replace it with a Dict to make programmatic usage easier.

https://github.com/pytorch/torchx/blob/main/torchx/specs/api.py#L451

[torchx/local_scheduler] Remove dependency to pr_set_deathsig for handling orphans/zombies

Description

Currently local_scheuduler (https://github.com/pytorch/torchx/blob/master/torchx/schedulers/local_scheduler.py#L317-L319) depends on the linux pr_set_deathsig to ensure that when the scheduler is terminated so are the processes that were launched with the scheduler. Unfortunately pr_set_deathsig is only available on linux based platforms and not on macos.

Motivation/Background

Support running torchx examples and basic functionalities on macos.

Detailed Proposal

The same issue exists in pytorch (torch.distributed.elastic.multiprocessing). See this PR (pytorch/pytorch#61602). The solution for local_scheduler is the same.

Alternatives

N/A - the alternative is to drop support for macos altogether and have macos users use the torchx docker image to test things out. While we recommend users to use linux based systems for using torchx in production, it is plausible for someone new to torchx to test things out on their laptop hence we'd like to make the "quickstart" path work on macos.

Additional context/links

N/A

[torchx][components] Enhance integ components test with automatic testing

Description

The feature changes the way we perform components integ tests. Instead of manually adding ComponentsProvider for each builtin, we find all components available in the code and test them.

Detailed Proposal

Today torchx implements components testing as integ tests. For each builtin component, one needs to create a class using ComponentsProvider interface. Without the corresponding class, the builtin component will not be tested.

The proposition is to change this behavior. The component can be one of two types: 1. function with all default parameters, 2. function with default and required parameters.

Components of type#1 can be automatically instantiated and tested on different schedulers. Components #2 are more tricky. It is possible for component to be complicated: e.g. CopyComponent needs existing file for testing, and when the job is done, it would leave a new file in the system. The components of type#2 can be wrapped with special decorator:

@torchx.test(provider=CopyUtilProvider)
def copy(..):
    return AppDef(..)

class CopyUtilProvider(ComponentProvider):
    def setUp(self) -> None:
        # invoked for before component is instantiated

    def tearDown(self) -> None:
      # invoked component after test finishes

    def get_app_def(self) -> specs.AppDef:
        return copy(
            src=self._src_path, dst=self._dst_path, image=self._image
        )

In the code above, we can associate component with component provider. ComponentProvider is interface with the following methods: tearDown, setUp, get_app_def.

[torchx/version] Make torchx.version.TORCHX_IMAGE follow the same semantics as `version`

Description

torchx.__version__ and torchx.__image__ should be consistent. Currently version is obtained by torchx.__version__ but image is torchx.version.TORCHX_IMAGE.

Detailed Proposal

Add import torchx.version.TORCHX_IMAGE to the root __init__.py file so that TORCHX_IMAGE can be referenced as torchx.__image__. (In addition we should make torchx.__version__ and torchx.__image__ reflect FB version and image when used internally (and not from github or pip)).

Motivation/Background

Once this is done we can change all references to TORCHX_IMAGE in the builtin components which would make them work in oss and also at fb (since presumably torchx.__image__ would point to ghcr.io docker image for oss and our internal package at fb).

Alternatives

N/A

Additional context/links

N/A

Documentation feedback

📚 Documentation

At a high level the repo really needs a glossary of terms in a single page otherwise easy to forget what they mean when you get to a new page. Lots of content can be deleted specifically the example application notebooks don't add anything relative to the pipeline examples.

General list of feedback - not sure how to fix yet so opening issue instead of PR

https://pytorch.org/torchx/latest/quickstart.html

Add a link to where built in are defined in the code

https://pytorch.org/torchx/latest/cli.html

App bundle is never defined - is it the app ID? in the docs echo_c944ffb2?

https://pytorch.org/torchx/latest/configure.html

One thing wasn't too clear is a resource basically number of CPUs and GPUs? Would be helpful to add some helper enum which includes something higher level like a V100 machine for provisioning

https://pytorch.org/torchx/latest/examples_apps/datapreproc/component.html#sphx-glr-examples-apps-datapreproc-component-py

Example has no main function

https://pytorch.org/torchx/latest/examples_apps/datapreproc/datapreproc.html#sphx-glr-examples-apps-datapreproc-datapreproc-py

I'm not sure what the entire Application Examples notebooks do? May be best to refactor together with pipeline examples? Let me know I can send a PR to delete

https://pytorch.org/torchx/latest/examples_pipelines/kfp/intro_pipeline.html#sphx-glr-examples-pipelines-kfp-intro-pipeline-py

Make it clearer that pipeline.yaml is generated and not an input - I kept looking for it in the source directory

https://pytorch.org/torchx/latest/components/base.html

Why is torch.elastic mentioned here? Whole repo feels like it needs a glossary. For example by image do you mean docker image?

https://pytorch.org/torchx/latest/components/hpo.html

Just sat TBD - delete for now?

https://pytorch.org/torchx/latest/components/utils.html

Have a link to supported utils in the codebase

https://pytorch.org/torchx/latest/schedulers/kubernetes.html

This says coming soon but looks like feature is available?

https://pytorch.org/torchx/latest/beta.html

Also empty just remove