Giter Club home page Giter Club logo

Comments (10)

dacort avatar dacort commented on July 30, 2024 2

@PeterCarragher A few questions:

  • Are you trying to build an image with idna installed to use in EMR Studio (interactive) or as part of your batch jobs?
  • Which release of EMR are you using?
  • What are you doing with the bundled tar.gz? Are you providing options to EMR to make use of it?

For reference, the AL2023 image is only compatible with EMR 7.x. If you're using it with EMR Studio, you use a customized image based off the EMR Serverless image. If you're using it with batch jobs, you need to [provide the proper sparkSubmitParameters to copy/enable the virtualenv.

Let me know, happy to help and maybe even put together a little video. :)

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

Hi Eric, thanks for opening the issue - let me double-check this example to verify it still works.

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

@waltari2001 I just ran through the example step-by-step and it works for me. Some things for you to check:

  • Did the pyspark_ge.tar.gz get bundled properly? e.g. if you run tar tzvf pyspark_ge.tar.gz | grep great_expectations, are there entries in the tar file?
  • Check the EMR Serverless stderr and stdout logs and see if there's anything in there. In the stderr file, you should see lines like this:
24/01/08 17:43:52 INFO SparkContext: Added archive file:/tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz#environment at spark://[2600:1f14:2e15:a301:cd15:538d:7b16:decc]:46003/files/pyspark_ge.tar.gz with timestamp 1704735831138
24/01/08 17:43:52 INFO Utils: Copying /tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz to /tmp/spark-84fd23a2-89bc-4573-a253-1ee2ecc61486/pyspark_ge.tar.gz
24/01/08 17:43:52 INFO SparkContext: Unpacking an archive file:/tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz#environment from /tmp/spark-84fd23a2-89bc-4573-a253-1ee2ecc61486/pyspark_ge.tar.gz to /tmp/spark-69cc27ba-e1dc-4691-a6ce-bc504186703d/userFiles-85021b77-6d66-4148-858b-f34d9ed35489/environment

from emr-serverless-samples.

waltari2001 avatar waltari2001 commented on July 30, 2024
  • Yes, the bundle does include the great_expectations module.
  • Here's the stdout & stderr outputs

stderr:

Files s3://splunk-config-test/artifacts/pyspark/pyspark_ge.tar.gz#environment from /tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/pyspark_ge.tar.gz to /home/hadoop/environment
24/01/08 15:30:44 INFO ShutdownHookManager: Shutdown hook called
24/01/08 15:30:44 INFO ShutdownHookManager: Deleting directory /tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0

stdout:
Traceback (most recent call last):
File "/tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/ge_profile.py", line 4, in
import great_expectations as ge
ModuleNotFoundError: No module named 'great_expectations'

I am also using emr-7.0.0 with this test.

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

Ah interesting...just tested 7.0.0 now and found that it doesn't work. I was on 6.14.0.

Investigating more. It could be because EMR 7 is using Amazon Linux 2023...may need to swap out the Docker base image.

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

OK, the image definitely needs to be updated to al2023. That said, while I can submit the job and it starts running, it just hangs and the executors die. I'm unsure if this is a great expectations compatibility issue with Spark 3.5(?) or something else...I tried updating to the latest version of GE, but still experiencing the issue.

edit: I think it's user error - didn't configure the EMR Serverless application with networking and it's not in the same region as the source data

Let me know if that's relevant or if you were just trying to get dependencies working in general.

This is the dockerfile I used:

FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base

RUN dnf install -y gcc python3 python3-devel

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install \
    great_expectations==0.18.7 \
    venv-pack==0.2.0

RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

Yup, needed to configure the application properly.

This works now with Al2023 as the base image on EMR 7.x.

I'll leave this open until I update the examples.

from emr-serverless-samples.

PeterCarragher avatar PeterCarragher commented on July 30, 2024

Hi @dacort, I'm trying to use this Dockerfile.AI2023 as a blueprint for creating an environment that has certain packages installed (having run into the same ModuleNotFoundErrors using the original Dockerfile in the repo).
Specifically:

FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
RUN dnf install -y gcc python3 python3-devel

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install \
    botocore boto3 requests warcio idna \
    venv-pack==0.2.0

RUN mkdir /output && venv-pack -o /output/pyspark_venv.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_venv.tar.gz /outputs/

After running this (with DOCKER_BUILDKIT=1 sudo docker build --output . .) and inspecting the contents of the output .tar.gz, I find that it contains bin/python, bin/python3 and bin/python3.9 binaries.
However, none of these binaries have the required packages:

~/dev/cc/emr-serverless-samples/examples/pyspark/dependencies/outputs/bin$ ./python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import idna
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'idna'

docker build indicates that the packages have been installed:

[+] Building 144.7s (11/11) FINISHED                                                                                                                   docker:default
 => [internal] load build definition from Dockerfile                                                                                                             0.0s
 => => transferring dockerfile: 601B                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                0.0s
 => => transferring context: 2B                                                                                                                                  0.0s
 => [internal] load metadata for public.ecr.aws/amazonlinux/amazonlinux:2023-minimal                                                                             1.5s
 => [base 1/6] FROM public.ecr.aws/amazonlinux/amazonlinux:2023-minimal@sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb                  4.1s
 => => resolve public.ecr.aws/amazonlinux/amazonlinux:2023-minimal@sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb                       0.0s
 => => sha256:e831127e1f042c58b74da069439cf1452efa4314ca19c69b8e376186aabcb714 35.06MB / 35.06MB                                                                 2.0s
 => => sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb 770B / 770B                                                                       0.0s
 => => sha256:fd9eb74c5472b7e4286b3ae4b3649b2c7eb8968684e3a8d9158241417ca813be 529B / 529B                                                                       0.0s
 => => sha256:40c4449cff5bdec9bf82d3929159e57174488d711a8a9350790b24b3cc0104f3 1.48kB / 1.48kB                                                                   0.0s
 => => extracting sha256:e831127e1f042c58b74da069439cf1452efa4314ca19c69b8e376186aabcb714                                                                        1.9s
 => [base 2/6] RUN dnf install -y gcc python3 python3-devel                                                                                                     37.4s
 => [base 3/6] RUN python3 -m venv /opt/venv                                                                                                                     3.6s 
 => [base 4/6] RUN python3 -m pip install --upgrade pip &&     python3 -m pip install     great_expectations==0.18.7     venv-pack==0.2.0                       66.5s 
 => [base 5/6] RUN python3 -m pip install botocore boto3 requests warcio idna                                                                                    7.8s 
 => [base 6/6] RUN mkdir /output && venv-pack -o /output/pyspark_venv.tar.gz                                                                                    21.2s 
 => [export 1/1] COPY --from=base /output/pyspark_venv.tar.gz /outputs/                                                                                          1.2s 
 => exporting to client directory                                                                                                                                0.8s 
 => => copying files 169.91MB      

Any idea what the problem is? If I attempt to use the output archive in the intended fashion on EMR studio, I just get a ModuleNotFoundError.

from emr-serverless-samples.

PeterCarragher avatar PeterCarragher commented on July 30, 2024

@dacort thank you for the quick reply!
TLDR; using the correct image fixes the problem

I am trying to get up and running on EMR studio.

Application settings:

  • Release: EMR 7.0.0
  • Type: Spark
  • Architectre: x86_64

Job settings (passing the .tar.gz):

--conf spark.submit.pyFiles=s3://cc-pyspark/sparkcc.py 
--conf spark.archives=s3://cc-pyspark/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python

stdout

Jan 16, 2024 12:42:06 AM org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption
WARNING: spark.log4jHotPatch.enabled is set to true, but /usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not exist at the configured location

Files s3://cc-pyspark/pyspark_venv.tar.gz#environment from /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/pyspark_venv.tar.gz to /home/hadoop/environment
Files s3://cc-pyspark/sparkcc.py from /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/sparkcc.py to /home/hadoop/sparkcc.py
24/01/16 00:42:11 INFO ShutdownHookManager: Shutdown hook called
24/01/16 00:42:11 INFO ShutdownHookManager: Deleting directory /tmp/localPyFiles-3bbd171b-bb9c-42cc-aad9-66ff2dae7065
24/01/16 00:42:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca

stderr

Traceback (most recent call last):
  File "/tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/wat_extract_links.py", line 1, in <module>
    import idna
ModuleNotFoundError: No module named 'idna'

After getting this error I figured the issue was in the step where I set up the archive as described in the readme. So I debugged locally to see if the python binaries in the archive had these modules; I got the same import errors locally.

Following the URL you shared, I tried updating to an image that matches the EMR studio version I'm using:

FROM --platform=linux/x86_64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest AS base
USER root
RUN yum install -y gcc python3 python3-devel

Re-uploading to s3 and testing the EMR spark job, it runs now. Thanks for the help!

However testing the imports locally still fails.
For future reference is there a way to test the python binaries locally?

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

Hm, there's something odd going on here.

It looks like idna is included by default in Python 3.9 (which is used in EMR 7.0.0). So it should work locally without even doing anything with the image. For example:

❯ docker run --rm -it --entrypoint /bin/bash public.ecr.aws/emr-serverless/spark/emr-7.0.0
bash-5.2$ python3 -c "import idna; print(idna.decode('xn--eckwd4c7c.xn--zckzah'))"
ドメイン.テスト

from emr-serverless-samples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.