Giter Club home page Giter Club logo

Comments (9)

dacort avatar dacort commented on July 30, 2024

Hi @jrolstad! I'm guessing based on the error message we don't have org.apache.hadoop.hive.serde2.JsonSerDe included by default with EMR Serverless. I think there's a couple options here:

  • Figure out which one is included - maybe org.apache.hive.hcatalog.data.JsonSerDe?
  • Figure out which package provides that class and include it when submitting your job using --packages

I've poked around a little bit and it's not immediately clear to me which one to use. We might have org.openx.data.jsonserde.JsonSerDe so go ahead and give that one a try too.

from emr-serverless-samples.

jrolstad avatar jrolstad commented on July 30, 2024

Thanks, that's what I was assuming. Can you tell me where i can find the available libraries for EMR serverless so i can self-serve next time?

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

I was trying to find this info as well. :) I'd take a look at the default SerDe's listed for Hive ( https://cwiki.apache.org/confluence/display/Hive/SerDe ) since that's what will be included with each EMR release.

Also, if you have access to an EMR cluster or the EMR on EKS container images, you can poke around for Hive jars for your specific version.

# Run bash in the EMR on EKS container image
docker run --rm -it 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0 /bin/bash
# Find hive jars 
find / -iname '*hive*.jar'

# Look for JSON serdes
jar tvf /usr/lib/spark/jars/hive-serde-2.3.9-amzn-1.jar | grep -i json

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

(correcting my previous comment after looking deeper at the documentation myself)

I think org.apache.hive.hcatalog.data.JsonSerDe moved to org.apache.hadoop.hive.serde2.JsonSerDe in Hive 3 (EMR 6.x), so give that a shot. I flipped the class names in my original comment.

I'll try to give this a try on my end as well.

from emr-serverless-samples.

jrolstad avatar jrolstad commented on July 30, 2024

@dacort Thanks for the update on the naming. I tried the org.apache.hadoop.hive.serde2.JsonSerDe value and still received the same result (java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found) so I think there may be a version mismatch in one of the EMR serverless libraries being used to run these jobs.

I'm not using a cluster or EKS (trying to go all serverless) so unable to verify versions. Waiting to hear what you find on your end as well.

from emr-serverless-samples.

jrolstad avatar jrolstad commented on July 30, 2024

@dacort Let me know if you are able to verify on your side as well. If so, let me know where the log the issue for this as using EMR serverless with JSON data in an S3 bucket seems like a standard use case that should be addressed.

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

@jrolstad Just gave it a shot using your scripts linked above and it worked fine for me.

One thing I noticed is that you're using org.apache.hadoop.hive.serde2.JsonSerDe in the user_createtables.sql script, but your error message says that org.apache.hive.hcatalog.data.JsonSerDe is the class that's not found. If you ran the createtables script previously with the latter serde, you'll need to drop that table before running the script again. Hive on EMR Serverless uses the Glue Data Catalog, so you can either delete it in the Glue Console or add a DROP TABLE statement. This confused me as well, so I should make it more explicit in the README here.

As an aside, you can run the EMR on EKS container image locally without having to use EKS. It's handy for when you want to have a local EMR environment, but is primarily geared towards Spark.

from emr-serverless-samples.

jrolstad avatar jrolstad commented on July 30, 2024

Dropping the table and recreating worked! Thanks for the help.

from emr-serverless-samples.

dacort avatar dacort commented on July 30, 2024

Sweet, thanks for following up! I'll add a note to the Hive section re: that specific Serde.

from emr-serverless-samples.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.