Comments (10)
@PeterCarragher A few questions:
- Are you trying to build an image with
idna
installed to use in EMR Studio (interactive) or as part of your batch jobs? - Which release of EMR are you using?
- What are you doing with the bundled
tar.gz
? Are you providing options to EMR to make use of it?
For reference, the AL2023 image is only compatible with EMR 7.x. If you're using it with EMR Studio, you use a customized image based off the EMR Serverless image. If you're using it with batch jobs, you need to [provide the proper sparkSubmitParameters to copy/enable the virtualenv.
Let me know, happy to help and maybe even put together a little video. :)
from emr-serverless-samples.
Hi Eric, thanks for opening the issue - let me double-check this example to verify it still works.
from emr-serverless-samples.
@waltari2001 I just ran through the example step-by-step and it works for me. Some things for you to check:
- Did the
pyspark_ge.tar.gz
get bundled properly? e.g. if you runtar tzvf pyspark_ge.tar.gz | grep great_expectations
, are there entries in the tar file? - Check the EMR Serverless stderr and stdout logs and see if there's anything in there. In the
stderr
file, you should see lines like this:
24/01/08 17:43:52 INFO SparkContext: Added archive file:/tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz#environment at spark://[2600:1f14:2e15:a301:cd15:538d:7b16:decc]:46003/files/pyspark_ge.tar.gz with timestamp 1704735831138
24/01/08 17:43:52 INFO Utils: Copying /tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz to /tmp/spark-84fd23a2-89bc-4573-a253-1ee2ecc61486/pyspark_ge.tar.gz
24/01/08 17:43:52 INFO SparkContext: Unpacking an archive file:/tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz#environment from /tmp/spark-84fd23a2-89bc-4573-a253-1ee2ecc61486/pyspark_ge.tar.gz to /tmp/spark-69cc27ba-e1dc-4691-a6ce-bc504186703d/userFiles-85021b77-6d66-4148-858b-f34d9ed35489/environment
from emr-serverless-samples.
- Yes, the bundle does include the great_expectations module.
- Here's the stdout & stderr outputs
stderr:
Files s3://splunk-config-test/artifacts/pyspark/pyspark_ge.tar.gz#environment from /tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/pyspark_ge.tar.gz to /home/hadoop/environment
24/01/08 15:30:44 INFO ShutdownHookManager: Shutdown hook called
24/01/08 15:30:44 INFO ShutdownHookManager: Deleting directory /tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0
stdout:
Traceback (most recent call last):
File "/tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/ge_profile.py", line 4, in
import great_expectations as ge
ModuleNotFoundError: No module named 'great_expectations'
I am also using emr-7.0.0 with this test.
from emr-serverless-samples.
Ah interesting...just tested 7.0.0 now and found that it doesn't work. I was on 6.14.0.
Investigating more. It could be because EMR 7 is using Amazon Linux 2023...may need to swap out the Docker base image.
from emr-serverless-samples.
OK, the image definitely needs to be updated to al2023. That said, while I can submit the job and it starts running, it just hangs and the executors die. I'm unsure if this is a great expectations compatibility issue with Spark 3.5(?) or something else...I tried updating to the latest version of GE, but still experiencing the issue.
edit: I think it's user error - didn't configure the EMR Serverless application with networking and it's not in the same region as the source data
Let me know if that's relevant or if you were just trying to get dependencies working in general.
This is the dockerfile I used:
FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
RUN dnf install -y gcc python3 python3-devel
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install \
great_expectations==0.18.7 \
venv-pack==0.2.0
RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /
from emr-serverless-samples.
Yup, needed to configure the application properly.
This works now with Al2023 as the base image on EMR 7.x.
I'll leave this open until I update the examples.
from emr-serverless-samples.
Hi @dacort, I'm trying to use this Dockerfile.AI2023 as a blueprint for creating an environment that has certain packages installed (having run into the same ModuleNotFoundErrors using the original Dockerfile in the repo).
Specifically:
FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
RUN dnf install -y gcc python3 python3-devel
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install \
botocore boto3 requests warcio idna \
venv-pack==0.2.0
RUN mkdir /output && venv-pack -o /output/pyspark_venv.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_venv.tar.gz /outputs/
After running this (with DOCKER_BUILDKIT=1 sudo docker build --output . .
) and inspecting the contents of the output .tar.gz, I find that it contains bin/python, bin/python3 and bin/python3.9 binaries.
However, none of these binaries have the required packages:
~/dev/cc/emr-serverless-samples/examples/pyspark/dependencies/outputs/bin$ ./python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import idna
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'idna'
docker build indicates that the packages have been installed:
[+] Building 144.7s (11/11) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 601B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for public.ecr.aws/amazonlinux/amazonlinux:2023-minimal 1.5s
=> [base 1/6] FROM public.ecr.aws/amazonlinux/amazonlinux:2023-minimal@sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb 4.1s
=> => resolve public.ecr.aws/amazonlinux/amazonlinux:2023-minimal@sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb 0.0s
=> => sha256:e831127e1f042c58b74da069439cf1452efa4314ca19c69b8e376186aabcb714 35.06MB / 35.06MB 2.0s
=> => sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb 770B / 770B 0.0s
=> => sha256:fd9eb74c5472b7e4286b3ae4b3649b2c7eb8968684e3a8d9158241417ca813be 529B / 529B 0.0s
=> => sha256:40c4449cff5bdec9bf82d3929159e57174488d711a8a9350790b24b3cc0104f3 1.48kB / 1.48kB 0.0s
=> => extracting sha256:e831127e1f042c58b74da069439cf1452efa4314ca19c69b8e376186aabcb714 1.9s
=> [base 2/6] RUN dnf install -y gcc python3 python3-devel 37.4s
=> [base 3/6] RUN python3 -m venv /opt/venv 3.6s
=> [base 4/6] RUN python3 -m pip install --upgrade pip && python3 -m pip install great_expectations==0.18.7 venv-pack==0.2.0 66.5s
=> [base 5/6] RUN python3 -m pip install botocore boto3 requests warcio idna 7.8s
=> [base 6/6] RUN mkdir /output && venv-pack -o /output/pyspark_venv.tar.gz 21.2s
=> [export 1/1] COPY --from=base /output/pyspark_venv.tar.gz /outputs/ 1.2s
=> exporting to client directory 0.8s
=> => copying files 169.91MB
Any idea what the problem is? If I attempt to use the output archive in the intended fashion on EMR studio, I just get a ModuleNotFoundError.
from emr-serverless-samples.
@dacort thank you for the quick reply!
TLDR; using the correct image fixes the problem
I am trying to get up and running on EMR studio.
Application settings:
- Release: EMR 7.0.0
- Type: Spark
- Architectre: x86_64
Job settings (passing the .tar.gz):
--conf spark.submit.pyFiles=s3://cc-pyspark/sparkcc.py
--conf spark.archives=s3://cc-pyspark/pyspark_venv.tar.gz#environment
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python
stdout
Jan 16, 2024 12:42:06 AM org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption
WARNING: spark.log4jHotPatch.enabled is set to true, but /usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not exist at the configured location
Files s3://cc-pyspark/pyspark_venv.tar.gz#environment from /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/pyspark_venv.tar.gz to /home/hadoop/environment
Files s3://cc-pyspark/sparkcc.py from /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/sparkcc.py to /home/hadoop/sparkcc.py
24/01/16 00:42:11 INFO ShutdownHookManager: Shutdown hook called
24/01/16 00:42:11 INFO ShutdownHookManager: Deleting directory /tmp/localPyFiles-3bbd171b-bb9c-42cc-aad9-66ff2dae7065
24/01/16 00:42:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca
stderr
Traceback (most recent call last):
File "/tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/wat_extract_links.py", line 1, in <module>
import idna
ModuleNotFoundError: No module named 'idna'
After getting this error I figured the issue was in the step where I set up the archive as described in the readme. So I debugged locally to see if the python binaries in the archive had these modules; I got the same import errors locally.
Following the URL you shared, I tried updating to an image that matches the EMR studio version I'm using:
FROM --platform=linux/x86_64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest AS base
USER root
RUN yum install -y gcc python3 python3-devel
Re-uploading to s3 and testing the EMR spark job, it runs now. Thanks for the help!
However testing the imports locally still fails.
For future reference is there a way to test the python binaries locally?
from emr-serverless-samples.
Hm, there's something odd going on here.
It looks like idna
is included by default in Python 3.9 (which is used in EMR 7.0.0). So it should work locally without even doing anything with the image. For example:
❯ docker run --rm -it --entrypoint /bin/bash public.ecr.aws/emr-serverless/spark/emr-7.0.0
bash-5.2$ python3 -c "import idna; print(idna.decode('xn--eckwd4c7c.xn--zckzah'))"
ドメイン.テスト
from emr-serverless-samples.
Related Issues (20)
- Version mismatch in 'airflow/setup.py' HOT 3
- 'template_fields' in 'EmrServerlessDeleteApplicationOperator' should be a tuple, not a string HOT 2
- EmrServerlessStartJobOperator does not raise airflow exception HOT 6
- Getting "No module named 'airflow.compat'" HOT 3
- configuration_overrides shouldn't be required
- Parameter countdown cannot be passed to EmrServerlessStartJobOperator HOT 4
- EMR serverless "java.lang.ClassNotFoundException" HOT 2
- Hive Example Fails when using JSON Data HOT 9
- [pyspark-dependencies] - DockerFile does not automatically move the tar.gz file to the local folder HOT 5
- Would it be possible to add 'config' to the list of template fields for EmrServerlessStartJobOperator? HOT 1
- EMR Serverless plugin in conflict with Airflow 2.2.2 constraints file HOT 15
- MWAA 2.2.2 constraints file HOT 1
- how to exec hive sql file with parameters HOT 1
- Add support for "config" to be a templated field in the EmrServerlessCreateApplicationOperator
- EMR Serverless Adding Option to Boto3 for Glue Catlog HOT 1
- Consider using Cloudwatch Variables for Application ID HOT 1
- Custom python versions >= 3.10 fail on EMR Studio/Jupyter due to a badly patched version of livy HOT 7
- virtualenv is not used when calling subprocess module
- providing extra jar file using --jars is not working for pyspark jobs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emr-serverless-samples.