Comments (9)
Hi @jrolstad! I'm guessing based on the error message we don't have org.apache.hadoop.hive.serde2.JsonSerDe
included by default with EMR Serverless. I think there's a couple options here:
- Figure out which one is included - maybe
org.apache.hive.hcatalog.data.JsonSerDe
? - Figure out which package provides that class and include it when submitting your job using
--packages
I've poked around a little bit and it's not immediately clear to me which one to use. We might have org.openx.data.jsonserde.JsonSerDe
so go ahead and give that one a try too.
from emr-serverless-samples.
Thanks, that's what I was assuming. Can you tell me where i can find the available libraries for EMR serverless so i can self-serve next time?
from emr-serverless-samples.
I was trying to find this info as well. :) I'd take a look at the default SerDe's listed for Hive ( https://cwiki.apache.org/confluence/display/Hive/SerDe ) since that's what will be included with each EMR release.
Also, if you have access to an EMR cluster or the EMR on EKS container images, you can poke around for Hive jars for your specific version.
# Run bash in the EMR on EKS container image
docker run --rm -it 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.6.0 /bin/bash
# Find hive jars
find / -iname '*hive*.jar'
# Look for JSON serdes
jar tvf /usr/lib/spark/jars/hive-serde-2.3.9-amzn-1.jar | grep -i json
from emr-serverless-samples.
(correcting my previous comment after looking deeper at the documentation myself)
I think org.apache.hive.hcatalog.data.JsonSerDe
moved to org.apache.hadoop.hive.serde2.JsonSerDe
in Hive 3 (EMR 6.x), so give that a shot. I flipped the class names in my original comment.
I'll try to give this a try on my end as well.
from emr-serverless-samples.
@dacort Thanks for the update on the naming. I tried the org.apache.hadoop.hive.serde2.JsonSerDe
value and still received the same result (java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.data.JsonSerDe not found
) so I think there may be a version mismatch in one of the EMR serverless libraries being used to run these jobs.
I'm not using a cluster or EKS (trying to go all serverless) so unable to verify versions. Waiting to hear what you find on your end as well.
from emr-serverless-samples.
@dacort Let me know if you are able to verify on your side as well. If so, let me know where the log the issue for this as using EMR serverless with JSON data in an S3 bucket seems like a standard use case that should be addressed.
from emr-serverless-samples.
@jrolstad Just gave it a shot using your scripts linked above and it worked fine for me.
One thing I noticed is that you're using org.apache.hadoop.hive.serde2.JsonSerDe
in the user_createtables.sql
script, but your error message says that org.apache.hive.hcatalog.data.JsonSerDe
is the class that's not found. If you ran the createtables script previously with the latter serde, you'll need to drop that table before running the script again. Hive on EMR Serverless uses the Glue Data Catalog, so you can either delete it in the Glue Console or add a DROP TABLE statement. This confused me as well, so I should make it more explicit in the README here.
As an aside, you can run the EMR on EKS container image locally without having to use EKS. It's handy for when you want to have a local EMR environment, but is primarily geared towards Spark.
from emr-serverless-samples.
Dropping the table and recreating worked! Thanks for the help.
from emr-serverless-samples.
Sweet, thanks for following up! I'll add a note to the Hive section re: that specific Serde.
from emr-serverless-samples.
Related Issues (20)
- The suggested way of using Python libraries with EMR Serverless does not work HOT 24
- Version mismatch in 'airflow/setup.py' HOT 3
- 'template_fields' in 'EmrServerlessDeleteApplicationOperator' should be a tuple, not a string HOT 2
- EmrServerlessStartJobOperator does not raise airflow exception HOT 6
- Getting "No module named 'airflow.compat'" HOT 3
- configuration_overrides shouldn't be required
- Parameter countdown cannot be passed to EmrServerlessStartJobOperator HOT 4
- EMR serverless "java.lang.ClassNotFoundException" HOT 2
- [pyspark-dependencies] - DockerFile does not automatically move the tar.gz file to the local folder HOT 5
- Would it be possible to add 'config' to the list of template fields for EmrServerlessStartJobOperator? HOT 1
- EMR Serverless plugin in conflict with Airflow 2.2.2 constraints file HOT 15
- MWAA 2.2.2 constraints file HOT 1
- how to exec hive sql file with parameters HOT 1
- Add support for "config" to be a templated field in the EmrServerlessCreateApplicationOperator
- EMR Serverless Adding Option to Boto3 for Glue Catlog HOT 1
- Consider using Cloudwatch Variables for Application ID HOT 1
- ModuleNotFoundError when running sample code HOT 10
- Custom python versions >= 3.10 fail on EMR Studio/Jupyter due to a badly patched version of livy HOT 7
- virtualenv is not used when calling subprocess module
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emr-serverless-samples.