awslabs / amazon-emr-cli Goto Github PK

A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs

License: Apache License 2.0

Python 96.51% Dockerfile 3.49%

aws amazon-emr apache-spark emr-serverless

amazon-emr-cli's Introduction

EMR CLI

So we're all working on data pipelines every day, but wouldn't be nice to just hit a button and have our code automatically deployed to staging or test accounts? I thought so, too, thats why I created the EMR CLI (emr) that can help you package and deploy your EMR jobs so you don't have to.

The EMR CLI supports a wide variety of configuration options to adapt to your data pipeline, not the other way around.

Packaging - Ensure a consistent approach to packaging your production Spark jobs.
Deployment - Easily deploy your Spark jobs across multiple EMR environments or deployment frameworks like EC2, EKS, and Serverless.
CI/CD - Easily test each iteration of your code without resorting to messy shell scripts. :)

The initial use cases are:

Consistent packaging for PySpark projects.
Use in CI/CD pipelines for packaging, deployment of artifacts, and integration testing.

Warning: This tool is still under active development, so commands may change until a stable 1.0 release is made.

Quick Start

You can use the EMR CLI to take a project from nothing to running in EMR Serverless is 2 steps.

First, let's install the emr command.

python3 -m pip install -U emr-cli

Note This tutorial assumes you have already setup EMR Serverless and have an EMR Serverless application, job role, and S3 bucket you can use. If not, you can use the emr bootstrap command.

Create a sample project

emr init scratch

📔 Tip: Use --project-type poetry to create a Poetry project!

You should now have a sample PySpark project in your scratch directory.

scratch
├── Dockerfile
├── entrypoint.py
├── jobs
│   └── extreme_weather.py
└── pyproject.toml

1 directory, 4 files

Now deploy and run on an EMR Serverless application!

emr run \
    --entry-point entrypoint.py \
    --application-id ${APPLICATION_ID} \
    --job-role ${JOB_ROLE_ARN} \
    --s3-code-uri  s3://${S3_BUCKET}/tmp/emr-cli-demo/ \
    --s3-logs-uri  s3://${S3_BUCKET}/logs/emr-cli-demo/ \
    --build \
    --show-stdout

This command performs the following actions:

Packages your project dependencies into a Python virtual environment
Uploads the Spark entrypoint and packaged dependencies to S3
Starts an EMR Serverless job
Waits for the job to run to completion and shows the stdout of the Spark driver when finished!

And you're done. Feel free to modify the project to experiment with different things. You can simply re-run the command above to re-package and re-deploy your job.

EMR CLI Sub-commands

The EMR CLI has several subcommands that you can see by running emr --help

Commands:
  bootstrap  Bootstrap an EMR Serverless environment.
  deploy     Copy a local project to S3.
  init       Initialize a local PySpark project.
  package    Package a project and dependencies into dist/
  run        Run a project on EMR, optionally build and deploy
  status

bootstrap

emr bootstrap allows you to create a sample EMR Serverless or EMR on EC2 environment for testing. It assumes you have admin access and creates various resources for you using AWS APIs.

EMR Serverless

To create a bootstrap EMR Serverless environment, using the following command:

emr bootstrap \
    --target emr-serverless \
    --code-bucket <your_unique_new_bucket_name> \
    --job-role-name <your_unique_emr_serverless_job_role_name>

When you do this, the CLI creates a new EMR CLI config file at .emr/config.yaml that will set default locations for your emr run command.

init

The init command creates a new pyproject.toml or poetry project for you with a sample PySpark application.

init is required to create those project types as it also initializes a Dockerfile used to package your dependencies. Single-file PySpark jobs and simple Python modules do not require the init command to be used.

package

The package command bundles your PySpark code and dependencies in preparation for deployment. Often you'll either use package and deploy to deploy new artifacts to S3, or you'll use the --build flag in the emr run command to handle both of those tasks for you.

The EMR CLI automatically detects what type of project you have and builds the necessary dependency packages.

deploy

The deploy command copies the project dependencies from the dist/ folder to your specified S3 location.

run

The run command is intended to help package, deploy, and run your PySpark code across EMR on EC2, EMR on EKS, or EMR Serverless.

You must provide one of --cluster-id, --virtual-cluster-id, or --application-id to specify which environment to run your code on.

emr run --help shows all the available options:

Usage: emr run [OPTIONS]

  Run a project on EMR, optionally build and deploy

Options:
  --application-id TEXT         EMR Serverless Application ID
  --cluster-id TEXT             EMR on EC2 Cluster ID
  --virtual-cluster-id TEXT     EMR on EKS Virtual Cluster ID
  --entry-point FILE            Python or Jar file for the main entrypoint
  --job-role TEXT               IAM Role ARN to use for the job execution
  --wait                        Wait for job to finish
  --s3-code-uri TEXT            Where to copy/run code artifacts to/from
  --s3-logs-uri TEXT            Where to send EMR Serverless logs to
  --job-name TEXT               The name of the job
  --job-args TEXT               Comma-delimited string of arguments to be
                                passed to Spark job

  --spark-submit-opts TEXT      String of spark-submit options
  --build                       Package and deploy job artifacts
  --show-stdout                 Show the stdout of the job after it's finished
  --save-config                 Update the config file with the provided
                                options

  --emr-eks-release-label TEXT  EMR on EKS release label (emr-6.15.0) -
                                defaults to latest release

Support PySpark configurations

Single-file project - Projects that have a single .py entrypoint file.
Multi-file project - A more typical PySpark project, but without dependencies, that has multiple Python files or modules.
Python module - A project with dependencies defined in a pyproject.toml file.
Poetry project - A project using Poetry for dependency management.

Sample Commands

Create a new PySpark project (other frameworks TBD)

emr init project-dir

Package your project into a virtual environment archive

emr package --entry-point main.py

The EMR CLI auto-detects the project type and will change the packaging method appropriately.

If you have additional .py files, those will be included in the archive.

Deploy an existing package artifact to S3.

emr deploy --entry-point main.py --s3-code-uri s3://<BUCKET>/code/

Deploy a PySpark package to S3 and trigger an EMR Serverless job

emr run --entry-point main.py \
    --s3-code-uri s3://<BUCKET>/code/ \
    --application-id <EMR_SERVERLESS_APP> \
    --job-role <JOB_ROLE_ARN>

Build, deploy, and run an EMR Serverless job and wait for it to finish.

emr run --entry-point main.py \
    --s3-code-uri s3://<BUCKET>/code/ \
    --application-id <EMR_SERVERLESS_APP> \
    --job-role <JOB_ROLE_ARN> \
    --build \
    --wait

Re-run an already deployed job and show the stdout of the driver.

emr run --entry-point main.py \
    --s3-code-uri s3://<BUCKET>/code/ \
    --s3-logs-uri s3://<BUCKET>/logs/ \
    --application-id <EMR_SERVERLESS_APP> \
    --job-role <JOB_ROLE_ARN> \
    --show-stdout

Note: If the job fails, the command will exit with an error code.

Re-run your jobs with 7 characters.

If you provide the --save-config command to emr run, it will save a configuration file for you in .emr/config.yaml and next time you can use emr run with no parameters to re-run your job.

emr run --entry-point main.py \
    ... \
    --save-config

[emr-cli]: Config file saved to .emr/config.yaml. Use `emr run` to re-use your configuration.

❯ emr run
[emr-cli]: Using config file: .emr/config.yaml

🥳

Run the same job against an EMR on EC2 cluster

emr run --entry-point main.py \
    --s3-code-uri s3://<BUCKET>/code/ \
    --s3-logs-uri s3://<BUCKET>/logs/ \
    --cluster-id <EMR_EC2_CLUSTER_ID>
    --show-stdout

Or an EMR on EKS virtual cluster.

emr run --entry-point main.py \
    --s3-code-uri s3://<BUCKET>/code/ \
    --s3-logs-uri s3://<BUCKET>/logs/ \
    --virtual-cluster-id <EMR_EC2_CLUSTER_ID> \
    --job-role <EMR_EKS_JOB_ROLE_ARN> \
    --show-stdout

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

amazon-emr-cli's People

Contributors

Stargazers

Watchers

Forkers

kazdy subratcall co360 soumilshah1995 chrisjrabbott chrisabbott toannhu96 vvying matheusbsilva kszonsteg

amazon-emr-cli's Issues

EMR CLI run throws error

Hello I am using Mac OS running Docker container


emr run \
--entry-point entrypoint.py \
--application-id <ID> \
--job-role arn:aws:iam::043916019468:role/AmazonEMR-ExecutionRole-1693489747108 \
--s3-code-uri s3://jXXXv/emr_scripts/ \
--spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"  \
--build \
--wait

Error

[emr-cli]: Packaging assets into dist/
[+] Building 23.2s (10/15)                                                                                                                                                                                docker:desktop-linux
 => [internal] load build definition from Dockerfile                                                                                                                                                                      0.0s
 => => transferring dockerfile: 2.76kB                                                                                                                                                                                    0.0s
 => [internal] load .dockerignore                                                                                                                                                                                         0.0s
 => => transferring context: 83B                                                                                                                                                                                          0.0s
 => [internal] load metadata for docker.io/library/amazonlinux:2                                                                                                                                                          0.6s
 => [auth] library/amazonlinux:pull token for registry-1.docker.io                                                                                                                                                        0.0s
 => [base 1/7] FROM docker.io/library/amazonlinux:2@sha256:e218c279c7954a94c6f4c8ab106c3ea389675d429feec903bc4c93fa66ed4fd0                                                                                               0.0s
 => [internal] load build context                                                                                                                                                                                         0.0s
 => => transferring context: 486B                                                                                                                                                                                         0.0s
 => CACHED [base 2/7] RUN yum install -y python3 tar gzip                                                                                                                                                                 0.0s
 => CACHED [base 3/7] RUN python3 -m venv /opt/venv                                                                                                                                                                       0.0s
 => CACHED [base 4/7] RUN python3 -m pip install --upgrade pip                                                                                                                                                            0.0s
 => ERROR [base 5/7] RUN curl -sSL https://install.python-poetry.org | python3 -                                                                                                                                         22.6s
------
 > [base 5/7] RUN curl -sSL https://install.python-poetry.org | python3 -:
22.53 Retrieving Poetry metadata
22.53 
22.53 # Welcome to Poetry!
22.53 
22.53 This will download and install the latest version of Poetry,
22.53 a dependency and package manager for Python.
22.53 
22.53 It will add the `poetry` command to Poetry's bin directory, located at:
22.53 
22.53 /root/.local/bin
22.53 
22.53 You can uninstall at any time by executing this script with the --uninstall option,
22.53 and these changes will be reverted.
22.53 
22.53 Installing Poetry (1.6.1)
22.53 Installing Poetry (1.6.1): Creating environment
22.53 Installing Poetry (1.6.1): Installing Poetry
22.53 Installing Poetry (1.6.1): An error occurred. Removing partial environment.
22.53 Poetry installation failed.
22.53 See /poetry-installer-error-ulqu13bo.log for error logs.
------
Dockerfile:32
--------------------
  30 |     
  31 |     RUN python3 -m pip install --upgrade pip
  32 | >>> RUN curl -sSL https://install.python-poetry.org | python3 -
  33 |     
  34 |     ENV PATH="$PATH:/root/.local/bin"
--------------------
ERROR: failed to solve: process "/bin/sh -c curl -sSL https://install.python-poetry.org | python3 -" did not complete successfully: exit code: 1
Traceback (most recent call last):
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/bin/emr", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context().obj, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/emr_cli/emr_cli.py", line 238, in run
    p.build()
  File "/Users/soumilnitinshah/IdeaProjects/DemoProject/venv/lib/python3.11/site-packages/emr_cli/packaging/python_project.py", line 56, in build

Add a build flag on deploy

Currently, the deploy command assumes that the project has already been built or packaged.

We should either (or both):

Add some safeguards to prevent deploying if the artifacts don't exist
Add a --build flag to the deploy command similar to how we have on the run command.

--entry-point in subfolder does not work

If the entrypoint file is in a subfolder locally, it attempts to use that subfolder in constructing the entrypoint path on S3.

We should either:

Truncate the folder when constructing the entrypoint
Upload the entrypoint to the same subfolder in S3

Add documentation for emr-cli config file

Today, the EMR CLI supports a a config file for storing information related to your deployment environment (app ID, job role, etc), but it's completely undocumented. We should add:

Documentation around the config file
Command-line flags to save/update the config file, e.g. emr run would be much easier than emr run with 9 different flags. 😆 So we could add a new option like emr run --save-config that would save the set of arguments you used so you can reuse it the next time. This is useful when you're using the EMR CLI to iteratively develop/run a job in a remote environment.

Support of adding only extra spark properties for JobRun

Hi there,

As I understand it, there is a config_overrides = {} piece of code that will always override all existing configs at the app level. Consequently, I assume that from the UI, for example, while cloning or starting a new job, only extra "Spark properties" might be provided for the job run.

From your point of view: is this currently supported? And if not, what's the easiest way to implement it? Perhaps an additional call to the GetApplication API to fetch that information. Ideally, I want to have some common-ground dependency and not have to worry about remembering to pass it every time

Add support for runtime roles for EMR steps

EMR 6.7.0 and later now support runtime roles that allow you to provide a role to use with a specific Step.

We should update the run command to accept a --job-role for EMR on EC2 clusters similar to how we do for EMR Serverless.

Sample job got failed : Python no such file or directory

[emr-cli]: Waiting for job to complete...
[emr-cli]: Job state is now: PENDING
[emr-cli]: Job state is now: SCHEDULED
[emr-cli]: Job state is now: RUNNING
[emr-cli]: Job state is now: FAILED
[emr-cli]: EMR Serverless job failed: Job failed, please check complete logs in configured logging destination. ExitCode: 1. Last few exceptions: Caused by: java.io.IOException: error=2, No such file or directory
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory...

Don't find option to configure external jars

What's the way to add external jars in Spark application.
My requirement is to add Delta Lake jars.

--conf spark.jars=s3://DOC-EXAMPLE-BUCKET/jars/delta-core_2.12-1.1.0.jar

Add support for `--show-logs` in cluster mode on EMR on EC2

With the recent --show-logs flag, we switch the deploy mode to client so that EMR steps can capture the driver stdout.

Unfortunately, --client mode doesn't work with additional archives provided via the --archives flag or --conf spark.archives parameter. See https://issues.apache.org/jira/browse/SPARK-36088 for more a related issue.

In order to support this for cluster mode, we'd need to parse the step stderr logs to retrieve the Yarn application ID, then fetch the Yarn application logs from S3.

Add support for `--profile` flag for choosing AWS profile

Currently, all boto3 clients are initiated without specifying the AWS profile name. Hence, the default profile will be used.

Adding the --profile flag would significantly improve the handiness of the CLI because it is very common to switch the profile frequently during development (for example, changing the target environment between dev and prod by switching the AWS profile).

Add timestamps to [emr-cli] logs

Hi,

Do you think it's fine to have a basic understanding of how much time it took to proceed between stages and when the job was initially started, by looking only at the logs in the terminal?

Add timeout option to the emr-serverless job run

The default timeout for jobs in the EMR serverless is 12 hours (https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/considerations.html). For most uses cases this timeout is sufficient, but not for all and especially for the spark structured streaming job, which is running all the time.

The solution proposal is to add the --emr-serverless-timeout flag to the run command and pass the value to the start_job_run method.

How to Enable Glue Hive MetaStore with EMR CLI

AWS has announced AWS EMR CLI

https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/

I have tried and CLi works great simplifies submitting jobs
However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI

Here is a sample of how we are submitting jobs

emr run --entry-point entrypoint.py
--application-id --job-role <arn>
--s3-code-uri s3:///emr_scripts/ --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
--build `
--wait

If you can kindly get back to us on issue that would be great 😃
@dacort

Add support for local builds

Not everybody wants to use Docker to build artifacts, or in some cases like in a CI pipeline it may be undesirable.

We should add support for some sort of --local-build flag that, if the system is compatible (Amazon Linux 2 only?), it has the proper set of commands to bootstrap the build environment and package the artifacts.

Implement PyInstaller

It would be useful to be able to install the CLI without having a functional Python installation and across different operating systems.

PyInstaller can be utilized for this.

In addition, in order to package the EMR CLI in the VS Code extension, this would provide an easy installation method.

Add --job-args and --spark-submit-opts for EMR on EC2

Add --job-args (variable never used) and --spark-submit-opts (variable not exists) for EMR on EC2.
Currently --job-args as well as --spark-submit-opts works only for EMR Serverless:

# application_id indicates EMR Serverless job
    if application_id is not None:
        # We require entry-point and job-role
        if entry_point is None or job_role is None:
            raise click.BadArgumentUsage(
                "--entry-point and --job-role are required if --application-id is used."
            )

        if job_args:
            job_args = job_args.split(",")
        emrs = EMRServerless(application_id, job_role, p)
        emrs.run_job(
            job_name, job_args, spark_submit_opts, wait, show_stdout, s3_logs_uri
        )

    # cluster_id indicates EMR on EC2 job
    if cluster_id is not None:
        if job_args:
            job_args = job_args.split(",")
        emr = EMREC2(cluster_id, p, job_role)
        emr.run_job(job_name, job_args, wait, show_stdout)  # add spark_submit_opts

Tasks

Beta Give feedback

Add --job-args to EMREC2.run_job
Add --spark-submit-opts to EMREC2.run_job
Options

Add the ability to start a local EMR dev environment

Currently the amazon-emr-vscode-toolkit provide a way to write your code on an EMR environment on your local machine leveraging the EMR on EKS container image with devcontainer.

This is limited to IDEs that support devcontainer. It would be good to have a standalone implementation from within the EMR CLI to start a local dev environment a mount the project as a volume to the docker container of the EMR image.

This feature would be interesting for PyCharm that offer a way to connect to a remote jupyter server

Finalize CLI API

The EMR CLI has been available for a while now and through my own usage and others, we have a good idea of the final set of commands and subcommands that should be supported by the CLI.

Today, certain things are confusing:

There are three main commands: package, deploy, and run
- emr package builds a local version of the assets, while emr run ... --build both packages and deploys the assets.
- Do we need both package and deploy? Or can we simply have build and run.

Typically, I only use run ... --build, but in CI/CD pipelines both package and deploy can be useful. Package if you want to move the assets yourself and deploy if you want to have the CLI do the copy for you in 1 step.

It would be useful to be able to chain these commands as opposed to providing parameters. For example:

emr build deploy run --entrypoint file.py ... would perform all of build, deploy, and run in that order. That said there are some things that don't make sense, so should protect against scenarios like this.

If you already built your assets, you wouldn't repeatedly use deploy run.
You wouldn't use emr build run

Add support for `--show-logs` on EMR Serverless

We recently added a new --show-logs flag for EMR on EC2, but this functionality is not yet supported on EMR Serverless.

In order to add this, we also need to add an --s3-log-uri flag for EMR Serverless. EMR on EC2 clusters have a common loguri defined at cluster start that we fetch at runtime.