This is a basic introduction to the CDE CLI. It is meant to be a quickstart for anyone starting from scratch.
For a more detailed resource or a reference for converting Spark Submits to CDE Spark Submits please visit this CDE CLI Demo.
This demo requires the following:
- A CDE Virtual Cluster in CDP Private or Public Cloud (Azure or AWS)
- Installing the CDE CLI on your local machine. Please visit these instructions for help.
- Basic familiarity with Spark, Python, and SQL will help. However, no modifications to the code are required.
- Preloading data to Cloud Storage is not required. The data will be created as part of the PySpark job.
Clone this GitHub repository to your local machine or download the files directly from GitHub.
To clone this GitHub repository, create and navigate to a local directory and then run the following command:
git clone https://github.com/pdefusco/CDE_CLI_Simple.git
On your local machine where you have installed the CLI, issue the following command.
cde --help
Review the output. Notice that there are eight high level command categories.
Available Commands:
airflow Airflow commands
backup Create and Restore CDE backups
credential Manage CDE credentials
help Help about any command
job Manage CDE jobs
resource Manage CDE resources
run Manage CDE runs
spark Spark commands
The airflow and spark commands allow you to trigger a CDE Airflow DAG or CDE Spark Submit respectively. They are useful for submitting a job to CDE with minimum work. For example, you may have developed a new Spark job and you would like to give it a quick test run. You can learn more by issuing:
cde airflow --help
or
cde spark --help
The job command allows you to create a CDE Job so you can run Airflow and Spark jobs. Why create a CDE Job if you can just use the previous commands? A CDE Job allows you to specify additional options, for example setting an execution schedule or listing previous runs. You can learn more by issuing:
cde job --help
The resource command allows you to create a CDE Resource. CDE Resources are used to upload files, dependencies, requirements and more to the CDE Virtual Cluster. For more on CDE Resources please visit this link Try:
cde resource --help
The run command allows you to describe, list and learn more about each execution. It can be used in the context of CI/CD, integrations with 3rd party tools, and Job Observability.
cde run --help
Issue the following command to run the local PySpark file in CDE. Notice this does not require the data being in cloud storage as a prerequisite.
cde spark submit sql.py
Create a CDE Resource of type file with the following command. The resource will be named "cde_cli_simple".
cde resource create --name cde_cli_simple
Load the PySpark file to the CDE Resource. The command assumes the sql.py file is located in your local directory.
cde resource upload --name cde_cli_simple --local-path "sql.py" --resource-path "sql.py"
Once the file is located in the CDE Resource you can create a CDE Job with it.
You can optionally add more options such as a recurrent schedule as shown below.
cde job create --name "sql_job" --type "spark" --application-file "sql.py" --cron-expression "0 */1 * * *" --schedule-enabled "true" --schedule-start "2022-08-09" --schedule-end "2022-10-02" --mount-1-resource "cde_cli_simple"
With the following command the Spark CDE Job will be triggered immediately in the CDE Virtual Cluster.
cde job run --name "sql_job" --application-file "sql.py"
Issue the following commands to query the CDE Virtual Cluster and obtain the CDE Job runs.
cde run list
Let's create a new CDE Job with log level of "DEBUG". Notice you could have also used the cde job update command to modify the existing job.
cde job create --name "cde_cli_job_custom_log_level" --type "spark" --application-file "sql.py" --log-level "DEBUG" --schedule-enabled "false" --mount-1-resource "cde_cli_simple"
Now run the job:
cde job run --name "cde_cli_job_custom_log_level" --application-file "sql.py"
When you search for the job you can add multiple filter conditions as shown below:
cde run list --filter 'status[like]%succeeded%' --filter 'spark.spec.file[eq]sql.py' --filter 'job[eq]cde_cli_job_custom_log_level'
The above command will print a dictionary. Take note of the id field at the top.
Finally, issue the following command to collect logs. Make sure to replace the id integer with the one corresponding to your job run.
cde run logs --type "driver/stdout" --id 358
CDE is the Cloudera Data Engineering Service, a containerized managed service for Spark and Airflow. Top benefits of using CDE include:
- Ability to pick Spark versions
- Apache Airflow Scheduling
- Enhanced Observability with Tuning, Job Analysis and Troubleshooting interfaces.
- Job Management via a CLI and an API.
If you are exploring CDE you may find the following tutorials relevant:
-
CDE CLI Demo: A more advanced CDE CLI reference with additional details for the CDE user who wants to move beyond the basics shown here.
-
Spark 3 & Iceberg: a quick intro of Time Travel Capabilities with Spark 3
-
GitLab2CDE: a CI/CD pipeline to orchestrate Cross Cluster Workflows - Hybrid/Multicloud Data Engineering
-
CML2CDE: a CI/CD Pipeline to deploy Spark ETL at Scale with Python and the CDE API
-
Postman2CDE: using the Postman API to bootstrap CDE Services