Giter Club home page Giter Club logo

sagemaker-distribution's Introduction

Amazon SageMaker Distribution

Amazon SageMaker Distribution is a set of Docker images that include popular frameworks for machine learning, data science and visualization. For the list of supported SageMaker Distributions images, see SageMaker Distributions Images.

These images come in two variants, CPU and GPU, and include deep learning frameworks like PyTorch, TensorFlow and Keras; popular Python packages like numpy, scikit-learn and pandas; and IDEs like Jupyter Lab. The distribution contains the latest versions of all these packages such that they are mutually compatible.

This project follows semver (more on that below) and comes with a helper tool to automate new releases of the distribution.

Getting started

If you just want to use the images, you do not need to use this GitHub repository. Instead, you can pull pre-built and ready-to-use images from our AWS ECR Gallery repository.

Dependency versions included in a particular Amazon SageMaker Distribution version

If you want to check what packages are installed in a given version of Amazon SageMaker Distribution, you can find that in the relevant RELEASE.md file in the build_artifacts directory.

Versioning strategy

Amazon SageMaker Distribution supports semantic versioning as described on semver.org. A major version upgrade of Amazon SageMaker Distribution allows major version upgrades of all its dependencies, and similarly for minor and patch version upgrades. However, it is important to note that Amazon SageMaker Distribution’s ability to follow semver guidelines is currently dependent on how its dependencies adhere to them.

Some dependencies, such as Python, will be treated differently. Amazon SageMaker Distribution will allow a minor upgrade of Python (say, 3.10 to 3.11) only during a major upgrade (say, 4.8 to 5.0).

Image tags

Our current image tagging scheme is: <AMAZON_SAGEMAKER_DISTRIBUTION_VERSION_NUMBER>-<CPU_OR_GPU>. For example, the CPU version of Amazon SageMaker Distribution's v0.1.2 will carry the following tags:

  1. 0.1.2-cpu: Once an image is tagged with such a patch version, that tag will not be assigned to any other image in future.
  2. 0.1-cpu: this, and the two below, can change when new versions of Amazon SageMaker Distribution are released.
  3. 0-cpu
  4. latest-cpu

So, if you want to stay on the latest software as and when release by Amazon SageMaker Distribution, you can use latest-cpu and do a docker pull latest-cpu when needed. If you use, say, 0.1.2-cpu, the underlying distribution will remain the same over time.

Package Staleness Report

If you want to generate/view the staleness report for each of the individual packages in a given SageMaker distribution image version, then run the following command:

VERSION=<Insert SageMaker Distribution version in semver format here. example: 0.4.2>
python ./src/main.py generate-staleness-report --target-patch-version $VERSION

Package Size Delta Report

If you want to generate/view the package size delta report for a given SageMaker distribution image version comparing to a base image version, then run the following command:

BASE_PATCH_VERSION=<Insert SageMaker Distribution version of the base image in semver format here. example: 1.6.1>
VERSION=<Insert SageMaker Distribution version of the target image in semver format here. example: 1.6.2>
python ./src/main.py generate-size-report --base-patch-version $BASE_PATCH_VERSION --target-patch-version $VERSION

Example use cases

Here are some examples on how you can try out one of our images.

Local environment, such as your laptop

The easiest way to get it running on your laptop is through the Docker CLI:

export ECR_IMAGE_ID='INSERT_IMAGE_YOU_WANT_TO_USE'
docker run -it \
    -p 8888:8888 \
    -v `pwd`/sample-notebooks:/home/sagemaker-user/sample-notebooks \
    $ECR_IMAGE_ID jupyter-lab --no-browser --ip=0.0.0.0

(If you have access to Nvidia GPUs, you can pass --gpus=all to the Docker command.)

In the image, we also have entrypoints built in, that automatically starts IDE server and automatically restarts IDE server in case of minor IDE server interruptions or crashes. For example, to start JupyterLab server using the entrypoint built in:

export ECR_IMAGE_ID='INSERT_IMAGE_YOU_WANT_TO_USE'
docker run -it \
    -p 8888:8888 \
    --entrypoint entrypoint-jupyter-server \
    -v `pwd`/sample-notebooks:/home/sagemaker-user/sample-notebooks \
    $ECR_IMAGE_ID

In the console output, you'll then see a URL similar to http://127.0.0.1:8888/lab?token=foo. Just open that URL in your browser, create a Jupyter Lab notebook or open a terminal, and start hacking.

Note that the sample command above bind mounts a directory in pwd inside the container. That way, if you were to re-create the container (say, to use a different version or CPU/GPU variant), any files you created within that directory (such as Jupyter Lab notebooks) will persist.

Amazon SageMaker Studio

Amazon SageMaker Studio is a web-based, integrated development environment (IDE) for machine learning that lets you build, train, debug, deploy, and monitor your machine learning models.

To use the sagemaker-distribution image in SageMaker Studio, select SageMaker Distribution v{Major_version} {CPU/GPU} using the SageMaker Studio Launcher.

"I want to directly use the Conda environment, not via a Docker image"

Amazon SageMaker Distribution supports full reproducibility of Conda environments, so you don't necessarily need to use Docker. Just find the version number you want to use in the build_artifacts directory, open one of cpu.env.out or gpu.env.out and follow the instructions in the first 2 lines.

Customizing image

If you'd like to create a new Docker image on top of what we offer, we recommend you use micromamba install ... instead of pip install ....

For example:

FROM public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu
USER $ROOT
RUN apt-get install -y vim
USER $MAMBA_USER
RUN micromamba install sagemaker-inference --freeze-installed --yes --channel conda-forge --name base

FIPS

As of sagemaker-distribution: v0.12+, v1.6+, and v2+, the images come with FIPS 140-2 validated openssl provider available for use. You can enable the FIPS provider by running:

export OPENSSL_CONF=/opt/conda/ssl/openssl-fips.cnf

For more info on the FIPS provider see: https://github.com/openssl/openssl/blob/master/README-FIPS.md

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

sagemaker-distribution's People

Contributors

3coins avatar amazon-auto avatar ashleyzsy avatar avneeshn avatar aws-prayags avatar aws-tianquaw avatar balajisankar15 avatar bhupendrasingh avatar claytonparnell avatar dlqqq avatar edbeej avatar evelynyan21 avatar github-actions[bot] avatar gogakoreli avatar jasonweill avatar just4brown avatar ketan-vijayvargiya avatar kumarnzt avatar mohanasudhan avatar nandab-work avatar navinsoni avatar nikhilumeshsargur avatar oztoprakmustafa avatar rahrad123 avatar saimidu avatar tonyhoo avatar trajanikant avatar trnwwz avatar zuoyuanh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-distribution's Issues

Add sagemaker jupyter extensions

Add the following jupyter extension packages related to sagemaker to the CPU/GPU variants in 1.x.y series:

  • sagemaker-jupyterlab-extension
  • sagemaker-jupyterlab-emr-extension
  • amazon-sagemaker-jupyter-scheduler
  • jupyter-server-proxy

Automatically generate changelogs

We should improve the build command in our tool to automatically generate a changelog. For e.g., if we create a new release 0.2.0 on top of 0.1.4, the build command should also output a changelog of what all is changing. during that

This changelog could be a changelog.out file in the build_artifacts directory of the target_version. It could then contain a table where each row contains a package name, older version (i.e. the version in the base_version) and new version (== version in target_version).

aiobotocore dependency on old version of botocore causes dependency conflicts

Problem

Sagemaker-distribution will likely have multiple packages that (transitively) depend on aiobotocore.aiobotocore dependency on old version of botocore (1.31.17) causes dependency conflict with packages that depend on newer version of botocore. For example, because of this latest versions of s3fs and boto3 packages conflict (new versions of boto3 depend on botocore >=1.31.35,<1.32.0 and up).

Potential solution

Bump botocore to latest version in aiobotocore repository: aio-libs/aiobotocore#1043.

Issue finding closure for 1.5.2

sagemaker-distribution 5.0.1 came with notebook 7.0.7, jupyterlab 4.1.1, as notebook's dependencies were: jupyterlab >=4.0.7,<5. Now, notebook has patched the dependencies for 7.0.7 (link) changing requirement to jupyterlab >=4.0.7,<4.1.0a0. This means that the current 1.5 closure is broken, and in order to resolve we would need to manually update for 1.5.2 to either upgrade notebook to 7.1, or downgrade jupyterlab to 4.0.

Options here are:

  • Upgrade notebook to 7.1 in order to keep jupyterlab at 4.1
    This would mean we are upgrading notebook to a newer version, whose closure would match the previously installed jupyterlab
Package Previous Version Current Version
jupyterlab 4.1.1 4.1.2
jupyter-scheduler 2.5.0 2.5.1
scikit-learn 1.4.0 1.4.1.post1
notebook 7.0.7 7.1.0
langchain 0.1.6 0.1.9
matplotlib 3.8.2 3.8.3
  • Downgrade jupyterlab to 4.0
    This would be downgrading jupyterlab to match notebook's new closure. This would be the closure which would have been found, if notebook's dependencies had been correct.
Package Previous Version Current Version
jupyterlab 4.1.1 4.0.12
jupyter-scheduler 2.5.0 2.5.1
scikit-learn 1.4.0 1.4.1.post1
langchain 0.1.6 0.1.9
matplotlib 3.8.2 3.8.3

I could understand either approach. Downgrading could introduce breaking change, but would represent what the closure "should" be for this minor version. Upgrading notebook would go against our semver approach, but would prevent any breaking change (assuming notebook is following semver). Wanted to open this issue for discussion on which approach we want to take, as it's likely this will happen at some point in the future and we want to find the best approach.

Fix Scipy tests for GPU based docker images

Issue: Currently scipy tests are failing when running it for GPU .

More background:
In PyPi, scipy is vended out as a single package. Whereas in Conda forge, scipy is vended out as two different packages - scipy and scipy-tests. (This is a recent change that happened two months ago)

There is a possibility that either one of these packages might be out of date and indirectly causing tests to fail

Acceptance Criteria: Scipy tests should succeed for GPU based docker images. Another possible recommendation is to come up with a different testing plan for scipy.

How to maintain semver when consuming dependencies?

We try to follow semantic versioning (https://github.com/aws/sagemaker-distribution?tab=readme-ov-file#versioning-strategy, https://semver.org/). However, there are some problems with this in the way we consume dependencies.

  1. Any package we consume can just not follow semver, and there would be no way to detect it. This could introduce backward-incompatible changes. We do call this out in our Readme "it is important to note that Amazon SageMaker Distribution’s ability to follow semver guidelines is currently dependent on how its dependencies adhere to them"
  2. We consume several (15 as of 1.5.2-cpu) packages with major version 0 e.g. 0.y.z. According to semver docs, these versions do not adhere to any semver specifications, and should not be considered stable (see https://semver.org/#spec-item-4). Currently these are being treated the same as any other major version.

It would be nice to have some strategy/tooling to at least try to enforce semver principles for our releases, so opening this issue to host discussion on this topic.

Add gnupg2 to the SageMaker Distribution

For organizations that sign git commits with GPG, it would be helpful if GPG was installed in the SageMaker distribution by default?

For now, I've got it working by adding an entry to the SMStudio lifecycle configuration script like below, to install at run time:

if ! command -v gpg &> /dev/null
then
    echo "gpg not found: Trying to sudo apt-get install gnupg2"
    sudo apt-get install gnupg2
fi

Allow docker-push to public ECR repositories

Today, our script only allows pushing Docker images to private ECR repositories.

ECR's public gallery has a separate Boto3 client documented here. We should integrate that such that we are able to push to public repositories as well.

Fix Major version env.in generation logic

Description:

Currently Package staleness script is broken for any major version release, because we have an extra "Comma" in the env.in files. Conda Match spec is not able to process the given string and throws an exception

Example:

conda-forge::jinja2[version='>=3.1.2,']

We need to replace the logic such that, the trailing comma is not added if there are no version upper bound constraints.

Correct value should be

conda-forge::jinja2[version='>=3.1.2']

Add seaborn to sagemaker-distribution

Request to add seaborn to sagemaker-distribution image. This package was included in previous SageMaker images, but now customers would need to install themselves.

Configure git with AWS CodeCommit credential helper

This would allow working with AWS CodeCommit with temporary IAM credentials available in JupyterLab and CodeEditor when using the SageMaker Distribution in SageMaker Studio.

git config --global credential.helper '!aws codecommit credential-helper $@'
git config --global credential.UseHttpPath true

Still Beta?

Does the Beta notice in the README still apply? This seems to be public now and no longer relevant.

Fix sm-python-sdk test failure

While testing in preparation for a release, the sm-python-sdk test suite encountered a failure. This failure is a result of a backwards incompatible introduced in the newest minor version of sm-python-sdk. Consider reaching out to the owners of the package to have the compatibility issue fixed in the next release.

==================================== ERRORS ====================================
__________________ ERROR collecting tests/unit/test_tuner.py ___________________
/opt/conda/lib/python3.8/site-packages/sagemaker/interactive_apps/base_interactive_app.py:50: in __init__
    self.region = Session().boto_region_name
/opt/conda/lib/python3.8/site-packages/sagemaker/session.py:244: in __init__
    self._initialize(
/opt/conda/lib/python3.8/site-packages/sagemaker/session.py:271: in _initialize
    raise ValueError(
E   ValueError: Must setup local AWS configuration with a region supported by SageMaker.

During handling of the above exception, another exception occurred:
tests/unit/test_tuner.py:49: in <module>
    from .tuner_test_utils import *  # noqa: F403
tests/unit/tuner_test_utils.py:81: in <module>
    ESTIMATOR = Estimator(
/opt/conda/lib/python3.8/site-packages/sagemaker/estimator.py:3022: in __init__
    super(Estimator, self).__init__(
/opt/conda/lib/python3.8/site-packages/sagemaker/estimator.py:755: in __init__
    self.tensorboard_app = TensorBoardApp(region=self.sagemaker_session.boto_region_name)
/opt/conda/lib/python3.8/site-packages/sagemaker/interactive_apps/base_interactive_app.py:52: in __init__
    raise ValueError(
E   ValueError: Failed to get the Region information from the default config. Please either pass your Region manually as an input argument or set up the local AWS configuration.
=========================== short test summary info ============================
ERROR tests/unit/test_tuner.py - ValueError: Failed to get the Region informa...
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!

Add `*.env.out` to `.gitignore`

To my knowledge, *.env.out files can change on rebuild even if no code changes were made.

With this, I propose that our next steps are to:

  1. Verify that *.env.out files can change on rebuild even when no new packages are added to *.env.in, and verify that they're not being used for any purpose other than to track what packages went into a release
  2. Add *.env.out to .gitignore
  3. Add a GitHub rule enforcing that new PRs are rebased onto the latest commit of the target branch, as merge conflicts will no longer stop contributors from merging PR branches atop an old commit.

Jupyter 4.1.0 upgrade

Issue to track release timeline and any emergent blockers to support Jupyter v4.1.0

Python 3.8?

Any reason it's fixed on that old version?

Sagemaker Lab uses 3.9 already, switching to distribution as environment wasn't straightforward.

pandas test failures

We're seeing test failures for pandas for versions 2.1.1 and newer. Many of the failures were solved by fixing the test environment in the dockerfile -- see #107

However, there remains three failures still under investigation:

FAILED pandas/tests/frame/test_arithmetic.py::TestFrameFlexArithmetic::test_floordiv_axis0_numexpr_path[python-pow] - AssertionError: DataFrame.iloc[:, 0] (column name="0") are different
FAILED pandas/tests/io/excel/test_openpyxl.py::test_engine_kwargs_append_data_only[True-0-.xlsx] - AssertionError: assert None == 0
FAILED pandas/tests/io/excel/test_writers.py::TestExcelWriterEngineTests::test_ExcelWriter_dispatch[OpenpyxlWriter-.xlsx] - IndexError: At least one sheet must be visible

Issue using aws cli as of ~v1.4

Certain aws cli calls are failing with the following message:

sagemaker-user@default:~$ aws sts get-caller-identity

Unable to redirect output to pager. Received the following error when opening pager:
[Errno 2] No such file or directory: 'less'

Learn more about configuring the output pager by running "aws help config-vars".

Please install less to resolve: apt-get install less

jupyter-ai v2.10

Tracks inclusion of juptyer-ai v2.10 into the runtime. Ideal (but not required) to be included alongside JupyterLab v4.1.0 (tracked in #173)

Staleness report generator fails to parse some version strings

During the release for v1.5.0 the staleness report generator threw an error

python ./src/main.py generate-staleness-report --target-patch-version 1.5.0
Traceback (most recent call last):
  File "/local/home/jubrownd/workplace/sagemaker-distribution/./src/main.py", line 413, in <module>
    parse_args(get_arg_parser())
  File "/local/home/jubrownd/workplace/sagemaker-distribution/./src/main.py", line 409, in parse_args
    args.func(args)
  File "/local/home/jubrownd/workplace/sagemaker-distribution/src/package_staleness.py", line 96, in generate_package_staleness_report
    _get_installed_package_versions_and_conda_versions(image_config, target_version_dir,
  File "/local/home/jubrownd/workplace/sagemaker-distribution/src/package_staleness.py", line 86, in _get_installed_package_versions_and_conda_versions
    _get_package_versions_in_upstream(target_packages_match_spec_out,
  File "/local/home/jubrownd/workplace/sagemaker-distribution/src/package_staleness.py", line 23, in _get_package_versions_in_upstream
    package_version = get_semver(package_version)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/home/jubrownd/workplace/sagemaker-distribution/src/utils.py", line 26, in get_semver
    version = Version.parse(version_str)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jubrownd/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/semver/version.py", line 646, in parse
    raise ValueError(f"{version} is not valid SemVer string")
ValueError: 0.27.0.post1 is not valid SemVer string

It appears a version string in the env.out files resolved by conda is not actually a valid SemVer string (according to the SemVer parser used in our tool). We'll need to investigate to understand the discrepancy.

In this case the full package/version that causes the issue is:

https://conda.anaconda.org/conda-forge/linux-64/uvicorn-0.27.0.post1-py310hff52083_0.conda#ef1c54f288f05c7a2d931c8ff3596a76

SM studio - Docker CLI installation for private VPC accounts without internet

We are a part of studio account owners who are trying help our customers use the newly enabled localmode in SM studio for running our training and processing jobs. However since SM Studio localmode also requires installing docker-ce-cli and its corresponding dependencies, and our studio environment does not have internet access, installing docker does not seem straightforward. We cannot enable internet access due to security reasons, which means that its not possible for us to use SM Studio localmode features inside our private VPC.

Generate a table of package dependencies

As part of the build command, we generate couple of env.out files which contain detailed information about all the packages in the package closure.

We should also generate a new file which contains the same information but in a more human-consumable way. For example:

  • The new file could consolidate version numbers etc., per package, across CPU and GPU variants. That way, users don't have to tally multiple files to know which version of, say, PyTorch is present in CPU and GPU variants.
  • We can present the data in tabular format.

Using a custom SM image based on sagemaker-distribution hangs and fails in SM studio

I am not quite sure where to report but since the docs outline how to build a custom image I will try here.

I am building this custom image and pushing it to ECR and adding to sagemaker images and creating app image config, like one would according to the docs.

I am defining my docker image like this

FROM --platform=linux/amd public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu
USER $ROOT
RUN apt-get clean
# dependencies for building python and having opencv
RUN apt-get update && \
    apt-get install -y gcc g++ python3-dev ffmpeg libsm6 libxext6 && \
    rm -rf /var/lib/apt/lists/* && \
    apt-get clean

USER $MAMBA_USER
# copy the environment.yml file into the container
COPY --chown=$MAMBA_USER:$MAMBA_USER processing/environment.yml /tmp/environment.yml

# Use micromamba to install the dependencies from the environment.yml file
RUN micromamba install -y -n base -f /tmp/environment.yml && \
  micromamba clean --all --yes

The only difference I can see in the logs is this these two lines at 2024-02-09T10:23:34.006+01:00 and 2024-02-09T10:23:34.006+01:00


2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.875 ServerApp] Loading SageMaker Studio EMR server extension 0.1.9

2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.876 ServerApp] sagemaker_jupyterlab_emr_extension | extension was successfully loaded.

2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.876 ServerApp] Loading SageMaker JupyterLab server extension 0.2.0

2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.877 ServerApp] sagemaker_jupyterlab_extension | extension was successfully loaded.

2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.877 ServerApp] Loading SageMaker JupyterLab common server extension 0.1.9

2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.877 ServerApp] sagemaker_jupyterlab_extension_common | extension was successfully loaded.

2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.878 ServerApp] Serving notebooks from local directory: /home/sagemaker-user

this line -------> 2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.878 ServerApp] Jupyter Server 2.10.0 is running at: <-------- this line

2024-02-09T10:23:34.006+01:00	[I 2024-02-09 09:23:33.878 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

2024-02-09T10:23:34.006+01:00	[W 2024-02-09 09:23:33.882 ServerApp] No web browser found: Error('could not locate runnable browser').

this line -------> 2024-02-09T10:23:34.006+01:00	[C 2024-02-09 09:23:33.882 ServerApp] To access the server, open this file in a browser: file:///home/sagemaker-user/.local/share/jupyter/runtime/jpserver-1-open.html Or copy and paste one of these URLs: <------ and this line

2024-02-09T10:23:34.006+01:00	INFO: State start

2024-02-09T10:23:34.006+01:00	INFO: Scheduler at: inproc://169.255.254.1/1/1

2024-02-09T10:23:34.006+01:00	INFO: dashboard at: http://169.255.254.1:8787/status

2024-02-09T10:23:34.006+01:00	INFO: Registering Worker plugin shuffle

2024-02-09T10:23:34.006+01:00	INFO: Start worker at: inproc://169.255.254.1/1/4

2024-02-09T10:23:34.006+01:00	INFO: Listening to: inproc169.255.254.1

2024-02-09T10:23:34.006+01:00	INFO: Worker name: 0

2024-02-09T10:23:34.006+01:00	INFO: dashboard at: 169.255.254.1:39899

2024-02-09T10:23:34.006+01:00	INFO: Waiting to connect to: inproc://169.255.254.1/1/1

2024-02-09T10:23:34.006+01:00	INFO: -------------------------------------------------

2024-02-09T10:23:34.006+01:00	INFO: Threads: 2

2024-02-09T10:23:34.006+01:00	INFO: Memory: 3.78 GiB

2024-02-09T10:23:34.006+01:00	INFO: Local Directory: /tmp/dask-scratch-space/worker-ylr01t6j

2024-02-09T10:23:35.259+01:00	INFO: -------------------------------------------------

2024-02-09T10:23:35.260+01:00	INFO: Register worker <WorkerState 'inproc://169.255.254.1/1/4', name: 0, status: init, memory: 0, processing: 0>

2024-02-09T10:23:35.260+01:00	INFO: Starting worker compute stream, inproc://169.255.254.1/1/4

2024-02-09T10:23:35.260+01:00	INFO: Starting established connection to inproc://169.255.254.1/1/5

2024-02-09T10:23:35.260+01:00	INFO: Starting Worker plugin shuffle

2024-02-09T10:23:35.260+01:00	INFO: Registered to: inproc://169.255.254.1/1/1

2024-02-09T10:23:35.260+01:00	INFO: -------------------------------------------------

2024-02-09T10:23:35.260+01:00	INFO: Starting established connection to inproc://169.255.254.1/1/1

2024-02-09T10:23:35.260+01:00	INFO: Receive client connection: Client-e58dc3c4-c72c-11ee-8001-6efbcde7e649

2024-02-09T10:23:35.510+01:00	INFO: Starting established connection to inproc://169.255.254.1/1/6

2024-02-09T10:23:39.515+01:00	[I 2024-02-09 09:23:35.316 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-

In the working images they have a URL that's configured correctly.

I am in VPC only mode for the domain,, but I dont see how that should change anything since the sagemaker-distribution image works fine.

Would appreciate any pointer

Bug: Unable to find libdevice directory (${CUDA_DIR}/nvvm/libdevice) in GPU Images, Fix: Install cuda-nvcc to GPU images and update XLA_FLAGS

Bug Description:

While running Keras tests against the GPU Docker image, Tests were failing with the error - "Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice.

This issue has been discussed in several Github issues:
One such Github issue: tensorflow/tensorflow#58681

How to fix it:

  • Install cuda-nvcc from nvidia channel
  • After installation of cuda-nvcc, update XLA_FLAGS Environment variable to specify the Cuda Data Directory.
    export XLA_FLAGS=--xla_gpu_cuda_data_dir=/opt/conda

What is the impact:

  • Any sagemaker-distribution user might run into into the same issue while using the GPU images.

Ensure that integration tests don't change the underlying environment

Today, test_dockerfile_based_harness.py runs a bunch of integration tests against the images we build before those images are published to ECR.

As part of those tests, we may install a few dependencies via micromamba or pip. And that may modify the underlying package closure against which the tests run. So, after the tests run, we should assert that the original package closure hasn't changed in the test image (though new dependencies, that come through the test, are okay).

Error when running JupyterLab on 1.0.0-beta

I'm seeing an error when running JupyterLab on the 1.0.0-beta container:

$ jupyter lab
[E 2023-10-19 19:12:48.722 ServerApp] Exception while loading config file /etc/jupyter/jupyter_server_config.py
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.10/site-packages/traitlets/config/application.py", line 915, in _load_config_files
        config = loader.load_config()
      File "/opt/conda/lib/python3.10/site-packages/traitlets/config/loader.py", line 627, in load_config
        self._read_file_as_dict()
      File "/opt/conda/lib/python3.10/site-packages/traitlets/config/loader.py", line 660, in _read_file_as_dict
        exec(compile(f.read(), conf_filename, "exec"), namespace, namespace)  # noqa
      File "/etc/jupyter/jupyter_server_config.py", line 4, in <module>
        from nb_conda_kernels import CondaKernelSpecManager
    ModuleNotFoundError: No module named 'nb_conda_kernels'
...

The server is still able to start, and extensions still appear to be getting loaded & initialized correctly.

I believe this error is due to changes to the default JL config in #76. The default Conda base environment in the image does not have nb_conda_kernels installed. It's also unclear how this dependency is being used, and whether it should be used at all, given that the package has not been updated in nearly 3 years.

When will this build with CUDA 12.0?

I see that new versions are still built with cuda 11.8.

'CUDA_MAJOR_MINOR_VERSION': '11.8', # Should match the previous one.
Do you have an estimate of when it's going to be built with the new cuda?

Installing any DL packages outside of the installed scope is becoming a pain because everything has to be pointed to +cu118 packages, which is quite toilsome, especially for data scientists.

Update stale botocore/boto3 versions

Currently boto3 is 1.28.64. The latest patch is 1.28.85, and latest overall release is 1.34.30.

This is due to amazon-sagemaker-jupyter-scheduler's pinning of aiobotocore ==2.7.*, which subsequently pins botocore to <1.28.65.

We need amazon-sagemaker-jupyter-scheduler to loosen upper-bound so that we can update boto3/botocore to newer versions.

Add tests for nodejs

Need further investigation and implementation of appropriate testing steps to validate the installation of the nodejs runtime

Resolve Pydantic to v2

Three packages are responsible for forcing Conda to resolve Pydantic to v1 instead of v2. We need to update both Jupyter AI & Scheduler, but amazon-sagemaker-jupyter-scheduler will also need to be updated to loosen the version constraint on Pydantic:

$ pipdeptree --packages pydantic --reverse --depth 1 -w silence
pydantic==1.10.13
├── amazon-sagemaker-jupyter-scheduler==3.0.2 [requires: pydantic==1.*]
├── confection==0.1.3 [requires: pydantic>=1.7.4,<3.0.0,!=1.8.1,!=1.8]
├── jupyter-ai==2.4.0 [requires: pydantic~=1.0]
├── jupyter-ai-magics==2.4.0 [requires: pydantic~=1.0]
├── jupyter-scheduler==2.3.0 [requires: pydantic~=1.10]
├── langchain==0.0.318 [requires: pydantic>=1,<3]
├── langsmith==0.0.60 [requires: pydantic>=1,<3]
├── openapi-schema-pydantic==1.2.4 [requires: pydantic>=1.8.2]
├── spacy==3.7.2 [requires: pydantic>=1.7.4,<3.0.0,!=1.8.1,!=1.8]
├── thinc==8.2.1 [requires: pydantic>=1.7.4,<3.0.0,!=1.8.1,!=1.8]
└── weasel==0.3.4 [requires: pydantic>=1.7.4,<3.0.0,!=1.8.1,!=1.8]

Add a CLI editor program

Today (in SageMaker Studio), it seems impossible to git commit (without a -m message) from the terminal in SageMaker Distribution with the following error:

error: cannot run editor: No such file or directory
error: unable to start editor 'editor'

The cause seems to be that there's no CLI editor program installed at all by default in SageMaker Distribution? No emacs, nano, vim, nothing.

I wouldn't get in to the whole emacs vs vim debate because honestly I don't like using CLI editors so don't really care... but IMO there should really be something just so that git commit works out of the box?

Error when building 1.0.0-beta

I am seeing the following an error when building the images. The images are still being built and named, but an error is being printed afterwards. Here is how I am building the image:

# convenience for SageMaker Distribution
export SMD_DEFAULT_VERSION="1.0.0-beta"
export SMD_CONDA_ENV="sagemaker-distribution"
function smdi-build() {
    local version="${1:-$SMD_DEFAULT_VERSION}"
    echo "Entering $SMD_CONDA_ENV Conda environment..."
    conda activate $SMD_CONDA_ENV
    echo "Building SMD image v$version..."
    python ./src/main.py build \
        --target-patch-version $version \
        --force \
        --skip-tests
}

Here is the terminal output:

% smdi-build
Entering sagemaker-distribution Conda environment...
Building SMD image v1.0.0-beta...
Successfully built an image with id: sha256:5fd94f2711351c8a29909e6c09e02a80a2ea80360e4c34d84a8f2c32b112fbe2
[WARN]: Generating CHANGELOG is skipped because 'source-version.txt' isn't found.
Successfully built an image with id: sha256:1de8571f9d9c798c66972891e16546996a45023218eebd94711d593cc8fbb703
[WARN]: Generating CHANGELOG is skipped because 'source-version.txt' isn't found.
Traceback (most recent call last):
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/version.py", line 43, in __call__
    return cls._cache_[arg]
           ~~~~~~~~~~~^^^^^
KeyError: '>=2.0.0,'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/cli/common.py", line 95, in arg2spec
    spec = MatchSpec(arg)
           ^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/match_spec.py", line 55, in __call__
    return super().__call__(**parsed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/match_spec.py", line 179, in __init__
    self._match_components = self._build_components(**kwargs)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/match_spec.py", line 414, in _build_components
    return frozendict(_make_component(key, value) for key, value in kwargs.items())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/_vendor/frozendict/__init__.py", line 21, in __init__
    self._dict = self.dict_cls(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/match_spec.py", line 414, in <genexpr>
    return frozendict(_make_component(key, value) for key, value in kwargs.items())
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/match_spec.py", line 428, in _make_component
    matcher = _implementors[field_name](value)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/version.py", line 45, in __call__
    val = cls._cache_[arg] = super().__call__(arg)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/version.py", line 516, in __init__
    vspec_str, matcher, is_exact = self.get_matcher(vspec)
                                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/version.py", line 521, in get_matcher
    vspec = treeify(vspec)
            ^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/version.py", line 380, in treeify
    apply_ops("(")
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/models/version.py", line 354, in apply_ops
    raise InvalidVersionSpec(spec_str, "cannot join single expression")
conda.exceptions.InvalidVersionSpec: Invalid version '>=2.0.0,': cannot join single expression

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/local/home/dlq/workplace/sagemaker-distribution/./src/main.py", line 384, in <module>
    parse_args(get_arg_parser())
  File "/local/home/dlq/workplace/sagemaker-distribution/./src/main.py", line 380, in parse_args
    args.func(args)
  File "/local/home/dlq/workplace/sagemaker-distribution/./src/main.py", line 140, in build_images
    generate_release_notes(target_version)
  File "/local/home/dlq/workplace/sagemaker-distribution/src/release_notes_generator.py", line 51, in generate_release_notes
    image_type_package_metadata = _get_image_type_package_metadata(target_version_dir)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/home/dlq/workplace/sagemaker-distribution/src/release_notes_generator.py", line 42, in _get_image_type_package_metadata
    image_type_package_metadata[image_generator_config['image_type']] = _get_installed_packages(
                                                                        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/home/dlq/workplace/sagemaker-distribution/src/release_notes_generator.py", line 16, in _get_installed_packages
    required_packages_from_target = get_match_specs(
                                    ^^^^^^^^^^^^^^^^
  File "/local/home/dlq/workplace/sagemaker-distribution/src/utils.py", line 39, in get_match_specs
    assert len(requirement_spec.environment.dependencies) == 1
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda_env/specs/requirements.py", line 49, in environment
    return env.Environment(name=self.name, dependencies=dependencies)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda_env/env.py", line 237, in __init__
    self.dependencies = Dependencies(dependencies)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda_env/env.py", line 198, in __init__
    self.parse()
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda_env/env.py", line 210, in parse
    self["conda"].append(common.arg2spec(line))
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dlq/miniconda3/envs/sagemaker-distribution/lib/python3.11/site-packages/conda/cli/common.py", line 99, in arg2spec
    raise CondaValueError("invalid package specification: %s" % arg)
conda.exceptions.CondaValueError: invalid package specification: conda-forge::pytorch-gpu[version='>=2.0.0,']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.