Giter Club home page Giter Club logo

aws-glue-libs's Introduction

aws-glue-libs

This repository supports python libraries for local development of glue pyspark batch jobs. Glue streaming is supported in the separate repository aws-glue-streaming-libs.

Contents

This repository contains:

  • awsglue - the Python libary you can use to author AWS Glue ETL job. This library extends Apache Spark with additional data types and operations for ETL workflows. It's an interface for Glue ETL library in Python.
  • bin - this directory hosts several executables that allow you to run the Python library locally or open up a PySpark shell to run Glue Spark code interactively.

Python versions by Glue Version

Different Glue versions support different Python versions. The following table below is for your reference, which also includes the associated repository's branch for each glue version.

Glue Version Python 3 Version aws-glue-libs branch
2.0 3.7 glue-2.0
3.0 3.7 glue-3.0
4.0 3.10 master

You may refer to AWS Glue's official release notes for more information

Setup guide

If you haven't already, please refer to the official AWS Glue Python local development documentation for the official setup documentation. The following is a summary of the AWS documentation:

The awsglue library provides only the Python interface to the Glue Spark runtime, you need the Glue ETL jar to run it locally. The jar is now available via the maven build system in a s3 backed maven repository. Here are the steps to set up your dev environment locally.

  1. install Apache Maven from the following location: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
  2. use copy-dependencies target in Apache Maven to download the jar from S3 to your local dev environment.
  3. download and extract the Apache Spark distribution based on the Glue version you're using:
    • Glue version 2.0: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-2.0/spark-2.4.3-bin-hadoop2.8.tgz1
    • Glue version 3.0: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
    • Glue version 4.0: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-4.0/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0.tgz
  4. export the SPARK_HOME environmental variable to the extracted location of the above Spark distribution. For example:
    Glue version 2.0: export SPARK_HOME=/home/$USER/spark-2.4.3-bin-hadoop2.8
    Glue version 3.0: export SPARK_HOME=/home/$USER/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3
    Glue version 4.0: export SPARK_HOME=/home/$USER/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0
    
  5. now you can run the executables in the bin directory to start a Glue Shell or submit a Glue Spark application.
    Glue shell: ./bin/gluepyspark
    Glue submit: ./bin/gluesparksubmit
    pytest: ./bin/gluepytest
    

(The gluepytest script assumes that the pytest module is installed and available in the PATH env variable)

Licensing

The libraries in this repository licensed under the Amazon Software License (the "License"). They may not be used except in compliance with the License, a copy of which is included here in the LICENSE file.


Release Notes

July 26 2023

August 27 2021

  • The master branch has been modified from representing Glue 0.9 to Glue 3.0, we have also created a glue-0.9 branch to reflect the former state of the master branch with Glue 0.9. To rename your local clone of the older master branch and point to the glue-0.9 branch, you may use the following commands:
git branch -m master glue-0.9
git fetch origin
git branch -u origin/glue-0.9 glue-0.9
git remote set-head origin -a

aws-glue-libs's People

Contributors

jpeddicord avatar mbeacom avatar moomindani avatar neilagupta avatar skycmoon avatar svajiraya avatar vaviliv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws-glue-libs's Issues

Unable to Build AWSGlueETLPython 1.0.0 with maven when I run gluepyspark command on terminal.

[WARNING] The POM for com.amazonaws:AWSSDKGlueJavaClient:jar:1.0 is missing, no dependency information available [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1.509 s [INFO] Finished at: 2020-05-06T02:08:18+03:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal on project AWSGlueETLPython: Could not resolve dependencies for project com.amazonaws:AWSGlueETLPython:jar:1.0.0: Failure to find com.amazonaws:AWSSDKGlueJavaClient:jar:1.0 in https://aws-glue-etl-artifacts.s3.amazonaws.com/release/ was cached in the local repository, resolution will not be reattempted until the update interval of aws-glue-etl-artifacts has elapsed or updates are forced -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
Screenshot at May 06 02-16-51

GluePySpark errors when trying to run with Spark 2.4.3

I keep getting the following error message when trying to use gluepyspark with Spark 2.4.3 + Hadoop 2.8 (the version listed in the README.)

Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/aws-glue-libs/jars/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/09/24 05:35:01 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:59)
        at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
        at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
        at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
        at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "/home/jovyan/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/shell.py", line 38, in <module>
    SparkContext._ensure_initialized()
  File "/home/jovyan/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/home/jovyan/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "/home/jovyan/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

Edit: sorry, the above is the wrong log - will update soon.
Edit2: updated!

Long type field converts to Null type while join operation

I have DynamicFrame based on RDS table having bigint field who converts to long while creating DynamicFrame instance.

>>> opp = glueContext.create_dynamic_frame.from_catalog(
    database=SOURCE_DATABASE,
    table_name=SOURCE_OPPS_TABLE)
>>> opp.printSchema()
root
...
|-- OppID: int
|-- schedule_id: long
...

>>> present = glueContext.create_dynamic_frame.from_catalog(
         database=SOURCE_DATABASE,
         table_name=SOURCE_PRESENT_TABLE
    ).rename_field("OppID", "present_oppid")

opp_present = Join.apply(opp, present, "OppID", "present_oppid")
>>> opp_present.printSchema()
root
...
|-- OppID: int
|-- schedule_id: null
...

After joining with other DynamicFrame schedule_id converts to null type. The DynamicFrame opp_present have NO field with same name.

Is this repository unmaintained?

Hi it looks like that this repository cannot keep up to date with the latest version of Glue, this makes it very difficult to implement tests and new functionality, by far AWS Glue looks like a most incomplete AWS product I've come across :(.

There are PRs waiting for review, issues are piling up.
Commits are made by people without associated Github account and it's not possible to tag them.
Also there are only 2 contributors while last commits were made by someone else. (Gh account issues)

Plus it looks really disconnected from https://github.com/aws-samples/aws-glue-samples

@jpeddicord
@hyandell
@moomindani

Unable to Create GlueContext via GlueContext Function in Local Python/awsglue Environment

I'm having the issue described in issue #42.

I am attempting to run the following in my local PySpark console...

from awsglue.context import GlueContext
glueContext = GlueContext(sc)

We receive the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\XYZ\bin\aws-glue-libs\PyGlue.zip\awsglue\context.py", line 47, in __init__
  File "C:\Users\XYZ\bin\aws-glue-libs\PyGlue.zip\awsglue\context.py", line 68, in _get_glue_scala_context
TypeError: 'JavaPackage' object is not callable

The following is the complete picture:
image

The environment looks like the following:

  • OS: 10.0.17134.0
  • Python: 3.7.3
  • Hadoop (winutils.exe): 2.8.5
  • Spark: 2.4.3
  • PySpark: 2.4.6
  • awsglue: 1.0

My environment variables look like the following...

  • SPARK_HOME: \bin\spark-2.4.3-bin-hadoop2.8\
  • SPARK_CONF_DIR: \bin\aws-glue-libs\conf\
  • HADOOP_HOME: \bin\hadoop-2.8.5\
  • SPARK_CONF_DIR: \bin\spark-2.4.3-bin-hadoop2.8\
  • JAVA_HOME: C:\Progra~2\Java\jdk1.8.0\
  • CLASSPATH:
    • \bin\aws-glue-libs\jarsv1*
    • \bin\spark-2.4.3-bin-hadoop2.8\jars*
  • PYTHONPATH:
    • ${SPARK_HOME}\python\lib\py4j
    • \bin\aws-glue-libs\PyGlue.zip

Just to confirm which version awsglue repo I'm working with...

image

The following are the "netty" files in my ..\aws-glue-libs\jarsv1\:

image

I'm looking for a little guidance on how to tweak my configuration to resolve this issue.

unable to load aws credential with dynamodb service endpoint

hello,
While trying to access a crawled dynamodb table using aws-glue-libs I get this error:

WARN DynamoDBFibonacciRetryer: Retry: 2 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]

where should I put the credentials of my user or what acl should I set in the table xxxxxxxxxxxxxxxxx in order to be accessed by the library?

below the extended logs

G

19/12/19 09:49:55 INFO DynamoDBUtil: Using endpoint for DynamoDB: dynamodb.eu-west-1.amazonaws.com
19/12/19 09:49:55 INFO deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/12/19 09:49:55 INFO JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
19/12/19 09:49:55 INFO ReadIopsCalculator: Table name: xxxxxxxxxxxxxxxxx
19/12/19 09:49:55 INFO ReadIopsCalculator: Throughput percent: 0.5
19/12/19 09:49:57 WARN DynamoDBFibonacciRetryer: Retry: 1 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:49:59 WARN DynamoDBFibonacciRetryer: Retry: 2 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:01 WARN DynamoDBFibonacciRetryer: Retry: 3 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:03 WARN DynamoDBFibonacciRetryer: Retry: 4 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:06 WARN DynamoDBFibonacciRetryer: Retry: 5 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:09 WARN DynamoDBFibonacciRetryer: Retry: 6 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:12 WARN DynamoDBFibonacciRetryer: Retry: 7 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:16 WARN DynamoDBFibonacciRetryer: Retry: 8 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:21 WARN DynamoDBFibonacciRetryer: Retry: 9 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:27 WARN DynamoDBFibonacciRetryer: Retry: 10 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:37 WARN DynamoDBFibonacciRetryer: Retry: 11 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:50:53 WARN DynamoDBFibonacciRetryer: Retry: 12 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:51:08 WARN DynamoDBFibonacciRetryer: Retry: 13 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:51:36 WARN DynamoDBFibonacciRetryer: Retry: 14 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:52:34 WARN DynamoDBFibonacciRetryer: Retry: 15 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:53:57 WARN DynamoDBFibonacciRetryer: Retry: 16 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:56:36 WARN DynamoDBFibonacciRetryer: Retry: 17 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:59:50 WARN DynamoDBFibonacciRetryer: Retry: 18 Exception: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:59:57 ERROR DynamoDBFibonacciRetryer: Retries exceeded or non-retryable exception, throwing: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
19/12/19 09:59:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: Could not lookup table xxxxxxxxxxxxxxxxx in DynamoDB.
        at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:131)
        at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
        at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
        at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
        at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
        at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
        at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
        at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:132)
        at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:508)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
        at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:120)
        ... 20 more
Caused by: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
        at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:136)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1225)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:801)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:751)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4230)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4197)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:1885)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:1852)
        at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:124)
        at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:121)
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
        ... 21 more
19/12/19 09:59:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.RuntimeException: Could not lookup table xxxxxxxxxxxxxx in DynamoDB.
        at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:131)
        at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
        at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
        at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
        at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
        at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
        at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
        at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:132)
        at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:508)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
        at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:120)
        ... 20 more
Caused by: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
        at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:136)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1225)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:801)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:751)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4230)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4197)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:1885)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:1852)
        at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:124)
        at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:121)
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
        ... 21 more

19/12/19 09:59:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
19/12/19 09:59:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
19/12/19 09:59:57 INFO TaskSchedulerImpl: Cancelling stage 0
19/12/19 09:59:57 INFO DAGScheduler: ShuffleMapStage 0 (repartition at PartitioningStrategy.scala:29) failed in 602.545 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.RuntimeException: Could not lookup table xxxxxxxxxxxxxx in DynamoDB.
        at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:131)
        at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
        at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:58)
        at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:152)
        at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.<init>(AbstractDynamoDBRecordReader.java:84)
        at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.<init>(DefaultDynamoDBRecordReader.java:24)
        at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
        at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:132)
        at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:508)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
        at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:120)
        ... 20 more
Caused by: com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazonaws.auth.InstanceProfileCredentialsProvider@37ab3aad: Unable to load credentials from service endpoint]
        at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:136)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1225)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:801)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:751)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:4230)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:4197)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeDescribeTable(AmazonDynamoDBClient.java:1885)
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:1852)
        at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:124)
        at org.apache.hadoop.dynamodb.DynamoDBClient$1.call(DynamoDBClient.java:121)
        at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:80)
        ... 21 more

How to install the library to local machine?

Hi, I just started to use python for a few weeks. I try to install the awsglue library to my local machine to write scripts using the AWS Glue service. However, I did not find a setup.py file in the awsglue directory. I try several ways, but none of them works.

install using pip3: pip3 install git+https://github.com/awslabs/aws-glue-libs.git
result: Collecting git+https://github.com/awslabs/aws-glue-libs.git
Cloning https://github.com/awslabs/aws-glue-libs.git to /tmp/pip-6gdpim84-build
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.5/tokenize.py", line 454, in open
buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-6gdpim84-build/setup.py'

import in pycharm: import sys
sys.path.append("/home/frank/extra_library/aws-glue-libs")
import awsglue

result: /usr/bin/python3.5 /home/frank/PycharmProjects/kyc/read_psql_table.py
Traceback (most recent call last):
File "/home/frank/PycharmProjects/kyc/read_psql_table.py", line 4, in
import awsglue
File "/home/frank/extra_library/aws-glue-libs/awsglue/init.py", line 13, in
from dynamicframe import DynamicFrame
ImportError: No module named 'dynamicframe'

I know this could be a naive question, but could you please provide some clue to me?

Filter class is missing

I see that this library hasn't been updated in a while, which is a shame because I find it really useful.
The Filter class is missing from the lib entirely

preactions or push_down_predicate when creating DynamicFrame from JDBC/Redshift Connection/Catalog Source?

Hi,

I was curious if it was possible to supply the equivalent of preactions OR push_down_predicate to a Redshift Catalog Table when creating a dynamic frame? Preactions and postactions currently exist when writing your GlueDynamicFrame to a Redshift or JDBC connection (See Link Above), but is there similar functionality when creating a DyanmicFrame?

Please feel free to redirect me to an appropriate place to ask this question if this medium is inappropriate (Apologies if this is the case in advanced)!

create_dynamic_frame.from_catalog choke on double quote

I have a pipe delimited file that contains a field that is prefixed with a " like so:

TEST|611|"National Information Systems||Test_Last

This loads into glue and is queryable by Athena. I want to create a job that converts these files into parquet. When I do that, the job runs for several hours before ultimately failing. On a similar file without the double quote, the job runs in 9 minutes.

I hooked up a dev endpoint and fired up zeppelin to confirm that the job to hangs at glueContext.create_dynamic_frame.from_catalog(database = "test_db", table_name = "test_table") when the file with that double quote exists in s3.

I'd rather not have to clean out double quotes, especially since athena can read from this file just fine. I don't see a way to pass SerDe options to create_dynamic_frame.from_catalog, which would be super helpful. Or, just like #1, it'd be nice if this method used the schema and parsing specified by glue instead of recrawling the data.

By the way, are the scala glue libraries available as open source anywhere? My ability to contribute to this project are limited by the interface exposed there.

Can't start gluepyspark without manually changing dependencies

I get the error below, but I can get around it by commenting out the "mvn" line in glue-setup.sh and then replacing jarsv1/netty-all-4.0.23.Final.jar by netty-all-4.1.7.Final.jar (which I copied from the spark dependencies)

19/11/07 08:29:40 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)
/usr/local/spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
File "/usr/local/spark/python/pyspark/shell.py", line 41, in
spark = SparkSession._create_shell_session()
File "/usr/local/spark/python/pyspark/sql/session.py", line 583, in _create_shell_session
return SparkSession.builder.getOrCreate()
File "/usr/local/spark/python/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/usr/local/spark/python/pyspark/context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/usr/local/spark/python/pyspark/context.py", line 136, in init
conf, jsc, profiler_cls)
File "/usr/local/spark/python/pyspark/context.py", line 198, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/usr/local/spark/python/pyspark/context.py", line 306, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in call
answer, self._gateway_client, None, self._fqn)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.AbstractMethodError: io.netty.util.concurrent.MultithreadEventExecutorGroup.newChild(Ljava/util/concurrent/Executor;[Ljava/lang/Object;)Lio/netty/util/concurrent/EventExecutor;
at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:84)
at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:58)
at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:47)
at io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52)
at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:50)
at org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:102)
at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
at org.apache.spark.rpc.netty.NettyRpcEnv.(NettyRpcEnv.scala:71)
at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:461)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:57)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:249)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
at org.apache.spark.SparkContext.(SparkContext.scala:424)
at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

S3 Intelligent Tiering support

Hello -
Curious does AWS Glue package supports writing frames to S3 sink (with Intelligent Tiering support).

What's missing link is -- what exhaustive connection_options are supported by Glue package for S3 sink and does it have STORAGE_CLASS property in it -- what is its syntax?

Thank you in advance!

Br,
Ram

Interact with s3 / catalog offline?

Hello!

I'd like to develop AWS Glue scripts locally without using the development endpoint (for a series of reasons). I'm trying to execute a simple script, like the following:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# ETL body start
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "tests", table_name = "simple_table", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("timestamp", "string", "timestamp", "string"), ("colA", "string", "colB", "string")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://my-bucket/output"}, format = "csv", transformation_ctx = "datasink2")
# ETL body end

job.commit()

Now I try to execute it via gluesparksubmit, but it gives me an error about timing out:

20/06/30 15:49:54 WARN EC2MetadataUtils: Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document). Failed to connect to service endpoint: 
com.amazonaws.SdkClientException: Failed to connect to service endpoint: 
        at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:100)
        at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:70)
        at com.amazonaws.internal.InstanceMetadataServiceResourceFetcher.readResource(InstanceMetadataServiceResourceFetcher.java:75)
        at com.amazonaws.internal.EC2ResourceFetcher.readResource(EC2ResourceFetcher.java:66)
        at com.amazonaws.util.EC2MetadataUtils.getItems(EC2MetadataUtils.java:402)
        at com.amazonaws.util.EC2MetadataUtils.getData(EC2MetadataUtils.java:371)
        at com.amazonaws.util.EC2MetadataUtils.getData(EC2MetadataUtils.java:367)
        at com.amazonaws.util.EC2MetadataUtils.getEC2InstanceRegion(EC2MetadataUtils.java:282)
        at com.amazonaws.regions.InstanceMetadataRegionProvider.tryDetectRegion(InstanceMetadataRegionProvider.java:59)
        at com.amazonaws.regions.InstanceMetadataRegionProvider.getRegion(InstanceMetadataRegionProvider.java:50)
        at com.amazonaws.regions.AwsRegionProviderChain.getRegion(AwsRegionProviderChain.java:46)
        at com.amazonaws.services.glue.util.EndpointConfig$.getConfig(EndpointConfig.scala:42)
        at com.amazonaws.services.glue.util.Job$.init(Job.scala:75)
        at com.amazonaws.services.glue.util.Job.init(Job.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:607)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
        at sun.net.www.http.HttpClient.New(HttpClient.java:339)
        at sun.net.www.http.HttpClient.New(HttpClient.java:357)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1205)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
        at com.amazonaws.internal.ConnectionUtils.connectToEndpoint(ConnectionUtils.java:52)
        at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:80)
        ... 24 more
Traceback (most recent call last):
  File "/glue/./scripts/glue_date_convert.py", line 17, in <module>
    job.init(args['JOB_NAME'], args)
  File "/glue/aws-glue-libs/PyGlue.zip/awsglue/job.py", line 38, in init
  File "/glue/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/glue/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/glue/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.amazonaws.services.glue.util.Job.init.
: com.amazonaws.SdkClientException: Unable to load region information from any provider in the chain
        at com.amazonaws.regions.AwsRegionProviderChain.getRegion(AwsRegionProviderChain.java:59)
        at com.amazonaws.services.glue.util.EndpointConfig$.getConfig(EndpointConfig.scala:42)
        at com.amazonaws.services.glue.util.Job$.init(Job.scala:75)
        at com.amazonaws.services.glue.util.Job.init(Job.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

How am I supposed to let it work?

Also, can I use minio as a local S3 endpoint in order to avoid accessing to the AWS services for local development? What about AWS Glue Catalog?

Thank you

No module named 'dynamicframe' Only in Windows

I'm trying to import the GlueContext in gluepyshell using the line below;
from awsglue.context import GlueContext

But the following error persists:


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/c/Downloads/glue/aws-glue-libs-master/awsglue/__init__.py", line 13, in <module>
    from dynamicframe import DynamicFrame
ModuleNotFoundError: No module named 'dynamicframe'

possible to use glue catalog ?

Hi,

is it possible to use aws-glue-libs with Glue catalog.
I was able to read a dataframe from the glue catalog.
However, i could not find a way to get query the catalog from spark.sql, example:

spark.sql("show databases") # returns default
sparl.sql("use mydatabasefromawsglue") # pyspark.sql.utils.AnalysisException: u"Database 'mydatabasefromawsglue' not found;"

Any clue ?

ResolveOption class docs state you can use action type Cast. Glue Context doesn't accept that type in convert_resolve_option method

Hello, is this part of the documentation wrong?

Please specify also target type if you choose Project and Cast action type.

It says I can use "cast" as an action type but the convert_resolve_option in the glue context here doesn't accept it as a valid parameter.

def convert_resolve_option(self, path, action, target):

Thanks

Create dynamicframe programmatically

For testing purposes I would like to create a dynamic frame for a specific test case.

I can create a data frame and then use fromDF() to create one, but that also mean a schema conforming to all rows will exist. So it won't really match the messy data I have in production.

The dynamic frame initialization seems to be hidden in a Java file somewhere, so reverse engineering this is cumbersome.

Creating from an array of dicts would be a nice feature, i.e:

DynamicFrame([{"id": 1}, {"id": "id2"}](

gluepyspark errors on local development

errors

[WARNING] Could not transfer metadata net.minidev:json-smart/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory
[WARNING] Could not transfer metadata commons-codec:commons-codec/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory
19/09/02 20:34:14 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.

versions and tracebacks

Using master and the bare instructions from

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS"

$ ls -1 /opt/
apache-maven-3.6.0
spark-2.2.1-bin-hadoop2.7

$ echo $SPARK_HOME
/opt/spark-2.2.1-bin-hadoop2.7

$ which mvn
/opt/apache-maven-3.6.0/bin/mvn
$ mvn --version
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T11:41:47-07:00)
Maven home: /opt/apache-maven-3.6.0
Java version: 1.8.0_201, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-58-generic", arch: "amd64", family: "unix"

$ which spark-shell
/opt/spark-2.2.1-bin-hadoop2.7/bin/spark-shell

$ git remote -v
origin	[email protected]:awslabs/aws-glue-libs.git (fetch)
origin	[email protected]:awslabs/aws-glue-libs.git (push)

$ git ll
* 968179f - (HEAD -> master, origin/master, origin/HEAD) Use AWSGlueETL jars to run the glue python shell/submit locally (5 days ago) <Vinay Kumar Vavili>
* 19c4d84 - Update year to 2019. (7 months ago) <Ben Sowell>
* 7e76cc9 - Update AWS Glue ETL Library to latest version (01/2019). (7 months ago) <Ben Sowell>
* 21ff9e2 - Adding standard files (1 year, 2 months ago) <Henri Yandell>


$ ./bin/gluepyspark 

...

[WARNING] Could not transfer metadata net.minidev:json-smart/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory
[WARNING] Could not transfer metadata commons-codec:commons-codec/maven-metadata.xml from/to aws-glue-etl-artifacts-snapshot (s3://aws-glue-etl-artifacts-beta/snapshot): Cannot access s3://aws-glue-etl-artifacts-beta/snapshot with type default using the available connector factories: BasicRepositoryConnectorFactory

...

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.911 s
[INFO] Finished at: 2019-09-02T20:34:12-07:00
[INFO] ------------------------------------------------------------------------
mkdir: cannot create directory โ€˜/home/joe/src/jupiter/jupiter-glue/aws-glue-libs/confโ€™: File exists
Python 3.6.7 | packaged by conda-forge | (default, Jul  2 2019, 02:18:42) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dlweber/src/jupiter/jupiter-glue/aws-glue-libs/jars/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark-2.2.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/09/02 20:34:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/09/02 20:34:14 WARN Utils: Your hostname, weber-jupiter resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
19/09/02 20:34:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/09/02 20:34:14 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:236)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:214)
java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/shell.py", line 45, in <module>
    spark = SparkSession.builder\
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 334, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 180, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 273, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoSuchMethodError: io.netty.util.ResourceLeakDetector.addExclusions(Ljava/lang/Class;[Ljava/lang/String;)V
	at io.netty.buffer.AbstractByteBufAllocator.<clinit>(AbstractByteBufAllocator.java:34)
	at org.apache.spark.network.util.NettyUtils.createPooledByteBufAllocator(NettyUtils.java:112)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:107)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:70)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:453)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:56)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:246)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:236)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/shell.py", line 54, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 334, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 180, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/pyspark/context.py", line 273, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoClassDefFoundError: Could not initialize class io.netty.buffer.PooledByteBufAllocator
	at org.apache.spark.network.util.NettyUtils.createPooledByteBufAllocator(NettyUtils.java:112)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:107)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:70)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:453)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:56)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:246)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:236)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

>>> 

gluepyspark using Python 2.7 instead of 3.7

Hello,

I'm trying to get Glue local development working and it's not working. One problem I'm having is that every time I run bin/gluepyspark it runs using Python 2.7 instead of 3.7.

If I do which python it shows /usr/bin/python, which is 2.7. However, if I do python --version I get 3.7.7. This is because I edited my path in .bash_profile. Also, for Pyspark I edited spark-env.sh and added PYSPARK_PYTHON='/usr/local/bin/python3'. If I run the command pyspark it runs under 3.7.7. The odd part is that it looks like gluepyspark is calling $SPARK_HOME/bin/pyspark so I'm not sure why it's running with a different python version.

How do I tell gluepyspark to use Python 3?

Thanks,
Dave

Error while running ETL script

I am getting the following error when I try to run the ETL script inspite of assigning the required libraries, PyGlue.zip to my PYTHONPATH. Could you tell me how to resolve this?

Traceback (most recent call last):
File "aws-glue-clone-db-to-s3.py", line 30, in
gc = GlueContext(sc)
File "/home/jupyter/notebooks/etl/libs/bmrn-glue-libs/build/libs/PyGlue.zip/awsglue/context.py", line 44, in init
File "/home/jupyter/notebooks/etl/libs/bmrn-glue-libs/build/libs/PyGlue.zip/awsglue/context.py", line 64, in _get_glue_scala_context
TypeError: 'JavaPackage' object is not callable

Py4JJavaError:An error occurred while calling o32.getDynamicFrame

Hi, While create a dynamic frame, below error is occurring

dyf = context.create_dynamic_frame_from_options(
"s3",
{
'paths': ['s3://{}'.format(bucket_name)],
'recurse': True,
'groupFiles': 'inPartition',
'groupSize': '134217728'
},
format='json'
)

Error Logs:
Traceback (most recent call last):
File "", line 11, in
File "/Users/prajiv/trivago/softwares/aws-glue-libs-master/PyGlue.zip/awsglue/context.py", line 155, in create_dynamic_frame_from_options
File "/Users/prajiv/trivago/softwares/aws-glue-libs-master/PyGlue.zip/awsglue/data_source.py", line 36, in getFrame
File "/Users/prajiv/trivago/softwares/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/Users/prajiv/trivago/softwares/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/prajiv/trivago/softwares/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.getDynamicFrame.
: java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.(Unknown Source)
at org.apache.xbean.asm5.ClassReader.(Unknown Source)
at org.apache.xbean.asm5.ClassReader.(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:443)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:426)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:426)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:257)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:256)
at scala.collection.immutable.List.foreach(List.scala:383)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:256)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)
at com.amazonaws.services.glue.DynamicFrame.(DynamicFrame.scala:125)
at com.amazonaws.services.glue.HadoopDataSource$$anonfun$getDynamicFrame$5.apply(DataSource.scala:487)
at com.amazonaws.services.glue.HadoopDataSource$$anonfun$getDynamicFrame$5.apply(DataSource.scala:465)
at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:63)
at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:63)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:57)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:63)
at com.amazonaws.services.glue.HadoopDataSource.getDynamicFrame(DataSource.scala:464)
at com.amazonaws.services.glue.DataSource$class.getDynamicFrame(DataSource.scala:73)
at com.amazonaws.services.glue.HadoopDataSource.getDynamicFrame(DataSource.scala:202)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.base/java.lang.Thread.run(Thread.java:835)

Support OR for SplitRows

Right now, it is not possible to have multiple split criteria for the same operator, e.g. col = 1 OR col = 2. We use multiple splits and unions in this case, which is not optimal.

Running gluepyspark results to Py4JJavaError while calling JavaSparkContext

I followed the steps of the README.md to use awsglue locally, here is the docker file I built:

FROM python:3.6-buster

WORKDIR /src

# INSTALL JAVA
RUN echo "deb http://ftp.us.debian.org/debian sid main" >> /etc/apt/sources.list && \
    apt-get update && \
    apt-get install -y git openjdk-8-jdk zip && \
    rm -rf /var/cache/apt/*

# INSTALL MAVEN as EXCEPTED by GLUE
RUN wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
RUN tar zxvf apache-maven-3.6.0-bin.tar.gz
ENV PATH=/src/apache-maven-3.6.0/bin:$PATH
RUN rm apache-maven-3.6.0-bin.tar.gz

COPY jobs /src/jobs
COPY spark_x /src/spark_x

# COPY AWSGLUE IN VENV
RUN git clone https://github.com/awslabs/aws-glue-libs.git /src/aws-glue-libs
WORKDIR /src/aws-glue-libs
RUN git checkout glue-1.0
ENV PATH=/src/aws-glue-libs/bin:$PATH

# Install GLUE SPARK
RUN wget https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
RUN tar -xvf spark-2.4.3-bin-hadoop2.8.tgz
RUN rm spark-2.4.3-bin-hadoop2.8.tgz
ENV SPARK_HOME="/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8"

Then calling in the container:

gluepyspark

or:

gluesparksubmit --master local[8] --deploy-mode client job.py [...]

Result to the following error:

[...]
java.lang.Thread.run(Thread.java:748)
/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/shell.py", line 41, in <module>
    spark = SparkSession._create_shell_session()
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/sql/session.py", line 583, in _create_shell_session
    return SparkSession.builder.getOrCreate()
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/context.py", line 367, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/context.py", line 136, in __init__
    conf, jsc, profiler_cls)
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/context.py", line 198, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/context.py", line 306, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/src/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.AbstractMethodError: io.netty.util.concurrent.MultithreadEventExecutorGroup.newChild(Ljava/util/concurrent/Executor;[Ljava/lang/Object;)Lio/netty/util/concurrent/EventExecutor;
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:84)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:58)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:47)
	at io.netty.channel.MultithreadEventLoopGroup.<init>(MultithreadEventLoopGroup.java:49)
	at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:61)
	at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:52)
	at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:50)
	at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:102)
	at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
	at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:71)
	at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:461)
	at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:57)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:249)
	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:424)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Automatic Date Parsing Incorrect

When creating a dynamic frame from a glue data catalog with all string fields, the DynamicFrame is assigning a type of date to one of the fields. The field does contain timestamps formatted as 05/May/2017:12:24:13 -0400 aka dd/MMM/yyyy:HH:mm:ss Z, but the DynamicFrame's parsing as date chops off the time fields.

Is there a way to override this behavior, either by passing a custom date parser or by leaving the schema as that which is specified from the catalog?

If the field remains a string, I could parse it manually by converting to and from a DataFrame.

I also posted a question to the forums (https://forums.aws.amazon.com/thread.jspa?messageID=802754 ). I'm new to pyspark, so maybe there's another way around this behavior?

No Module names awsglue.utils

Hi Team,
I am executing a simple pyspark program with imports as standard as below.

Imports:
import sys
import logging
import csv
import boto3, botocore
import time
import requests
#from awsglue.transforms import *
#from awsglue.utils import getResolvedOptions
#from pyspark.context import SparkContext
#from pyspark.sql.functions import *
#from awsglue.context import GlueContext
#from awsglue.dynamicframe import DynamicFrame
#from awsglue.job import Job
#from requests import api
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark import SparkContext, SparkConf

Error
20/02/16 22:56:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/02/16 22:56:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/home/UX45556/aws-glue/bin/gluetera.py", line 28, in
from awsglue.utils import getResolvedOptions
ImportError: No module named awsglue.utils
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

I am using spark-submit command to execute this script.

How to use connection from connection_options during local developing

Is it possible to use connections locally?

I mean, something like this:
write_dynamic_frame.from_jdbc_conf(frame=dynamic_frame, catalog_connection='my_jdbc_connection', connection_options={ 'dbtable': table_name, 'database': 'dbo'}, redshift_tmp_dir='s3://mybucket/test/users/trial_out/', transformation_ctx='ctx')

Job parameters and libraries

Can job parameters be passed for further accessing wia getResolvedOptions method? The same question is regarding all parameters from "Security configuration, script libraries, and job parameters (optional)" section.

When creating dataframe from data catalog: The security token included in the request is invalid

when I am trying to create a dynamic frame based off a data catalog, I am getting the below error:

com.amazonaws.services.glue.model.AWSGlueException: The security token included in the request is invalid. (Service: AWSGlue; Status Code: 400; Error Code: UnrecognizedClientException;

This is happening when I am running my job from my local which is trying to access the data catalog in aws environment

Glue API for Scala

I'm trying to write a glue script outside AWS script editor but dependencies for Scala API are missing.
Where can I find the dependency providing classes like com.amazonaws.services.glue.GlueContext, com.amazonaws.services.glue.MappingSpec and com.amazonaws.services.glue.ResolveSpec?

PS 1: Looks like issue #1517 is related
PS 2: For python I can find the library under https://github.com/awslabs/aws-glue-libs
PS 3: The issue I opened on aws-java-sdk was closed and I was told to open it here :)
aws/aws-sdk-java#1671

Python 3

Any news about the Python 3 version? There's less than a year left before support for Python 2.7 is dropped.

unbox works only on simple json

unbox seems to only extract fields in the first level of the json document, not deeper.

eg.
{
"first": "firstdata",
"second": {"key" : "value"}
}

after unbox:
we have the fields
first -> firstdata
second -> null

Overriding Cert Checking

I'm getting this error:
19/10/04 12:51:04 INFO AmazonHttpClient: Configuring Proxy. Proxy Host: Proxy Port: 8080 Exception type: <class 'py4j.protocol.Py4JJavaError'> Exception: An error occurred while calling o27.getCatalogSource. : com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1175) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1121) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:6298) at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:6265) at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:6254) at com.amazonaws.services.glue.AWSGlueClient.executeGetTable(AWSGlueClient.java:3983) at com.amazonaws.services.glue.AWSGlueClient.getTable(AWSGlueClient.java:3955) at com.amazonaws.services.glue.util.DataCatalogWrapper$$anonfun$1.apply(DataCatalogWrapper.scala:82) at com.amazonaws.services.glue.util.DataCatalogWrapper$$anonfun$1.apply(DataCatalogWrapper.scala:78) at scala.util.Try$.apply(Try.scala:191) at com.amazonaws.services.glue.util.DataCatalogWrapper.getTable(DataCatalogWrapper.scala:78) at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:150) at com.amazonaws.services.glue.GlueContext.getCatalogSource(GlueContext.scala:139) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) at sun.security.ssl.Handshaker.process_record(Handshaker.java:965) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.upgrade(DefaultHttpClientConnectionOperator.java:193) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.upgrade(PoolingHttpClientConnectionManager.java:389) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) at com.amazonaws.http.conn.$Proxy14.upgrade(Unknown Source) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:429) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1297) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) ... 29 more Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:397) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:302) at sun.security.validator.Validator.validate(Validator.java:262) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:330) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:237) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:132) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621) ... 54 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:392) ... 60 more

I want to disable cert checking for the time being (I am fully aware of the consequences). In AWS SDK applications, I could programmatically override the TrustManger. I am not sure how to do that in this case.

I've tried:

./gluesparksubmit ./job.py --JOB_NAME test1 --driver-java-options "-Dcom.sun.net.ssl.checkRevocation=false" --conf 'spark.executor.extraJavaOptions=-Dcom.sun.net.ssl.checkRevocation=false' --conf 'spark.driver.extraJavaOptions=-Dcom.sun.net.ssl.checkRevocation=false'

Underlying sun.java.command command (obtained from spark UI):
org.apache.spark.deploy.SparkSubmit --py-files /Users/USER/Documents/aws/EV-IV/glue_etl/Glue/aws-glue-libs/PyGlue.zip ../../../Glue Jobs/sbs_ds_incremental.py --JOB_NAME test1 --driver-java-options -Dcom.sun.net.ssl.checkRevocation=false --conf spark.executor.extraJavaOptions=-Dcom.sun.net.ssl.checkRevocation=false --conf spark.driver.extraJavaOptions=-Dcom.sun.net.ssl.checkRevocation=false

This does not alleviate the error. Any ways to disable cert checking? Perhaps the command is wrong?

Starting a job with a trigger causes error

I create a Job that uses 2 --extra-py-files, one of them is a library archived in a zip file, following the AWS guidelines.
When the job is started through the AWS Glue console, everything works fine. Whenever I use a trigger or a command line (start-job-run() ) to start the exact same job, I get the following error :

Resource Setup Error: Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR s3://bucket/path/to/my/zip/file.zip with URI s3. Please specify a class through --class.

I have tried using non-overridable parameters, specifying the extra-py-files in my AWS Cli command and in the trigger arguments, nothing seems to work.

Publish awsglue to PyPi

It would be much easier to install the awsglue python code to fix linting and code completion if you published the package to PyPi so we could install it with PIP.

Error in handling --<argument> vs <argument> in getResolvedOptions

Test Case:

from awsglue.utils import getResolvedOptions

getResolvedOptions(["--JOB_RUN_ID", "10", "--random", "7"], ["JOB_RUN_ID", "random"])

Expectation:

Traceback (most recent call last):
  ...
    getResolvedOptions(["--JOB_RUN_ID", "10", "--random", "7"], ["--JOB_RUN_ID", "random"])
  File "/Users/shuttl/Downloads/PyGlue.zip/awsglue/utils.py", line 77, in getResolvedOptions
RuntimeError: Using reserved arguments --JOB_RUN_ID

Actual:

Traceback (most recent call last):
  ...
    getResolvedOptions(["--JOB_RUN_ID", "10", "--random", "7"], ["JOB_RUN_ID", "random"])
  File "/Users/shuttl/Downloads/PyGlue.zip/awsglue/utils.py", line 92, in getResolvedOptions
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.py", line 1308, in add_argument
    return self._add_action(action)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.py", line 1682, in _add_action
    self._optionals._add_action(action)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.py", line 1509, in _add_action
    action = super(_ArgumentGroup, self)._add_action(action)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.py", line 1322, in _add_action
    self._check_conflict(action)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.py", line 1460, in _check_conflict
    conflict_handler(action, confl_optionals)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.py", line 1467, in _handle_conflict_error
    raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --JOB_RUN_ID: conflicting option string(s): --JOB_RUN_ID

Test Case:

from awsglue.utils import getResolvedOptions

print getResolvedOptions(["--JOB_RUN_ID", "10", "--random", "7", "----other", "5"], ["--other", "random"])['--other']

Expectation:

5

Actual:

Traceback (most recent call last):
  ...
    print getResolvedOptions(["--JOB_RUN_ID", "10", "--random", "7", "----other", "5"], ["--other", "random"])['--other']
KeyError: '--other'

Test Case:

from awsglue.utils import getResolvedOptions

print getResolvedOptions(["--JOB_RUN_ID", "10", "--random", "7", "----JOB_RUN_ID", "5"], ["--JOB_RUN_ID", "random"])["--JOB_RUN_ID"]

Expectation:

5

Actual:

Traceback (most recent call last):
  ...
    print getResolvedOptions(["--JOB_RUN_ID", "dkfjhbvfkdi_dfhjvb_13e_dfhk", "--random", "gtdgk"], ["--JOB_RUN_ID", "random"])
  File "/Users/shuttl/Downloads/PyGlue.zip/awsglue/utils.py", line 77, in getResolvedOptions
RuntimeError: Using reserved arguments --JOB_RUN_ID

pyspark.sql.utils.IllegalArgumentException: u"Bucket name should not contain '_'"

I am not sure at what level this problem occurs, but I had thought that '_' were allowed in bucket names?

glueContext.write_dynamic_frame.from_options(frame = expt_coll["experiment"], connection_type = "s3", connection_options = {"path": "s3://omics_metatata/temp_expt/"}, format = "parquet")

Results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/tmp/spark-53bfe016-e87e-4fb9-90d8-bea9225887e0/userFiles-3c08fed2-db6b-495c-a4d3-918277ee71a7/PyGlue.zip/awsglue/dynamicframe.py", line 551, in from_options
  File "/mnt/tmp/spark-53bfe016-e87e-4fb9-90d8-bea9225887e0/userFiles-3c08fed2-db6b-495c-a4d3-918277ee71a7/PyGlue.zip/awsglue/context.py", line 176, in write_dynamic_frame_from_options
  File "/mnt/tmp/spark-53bfe016-e87e-4fb9-90d8-bea9225887e0/userFiles-3c08fed2-db6b-495c-a4d3-918277ee71a7/PyGlue.zip/awsglue/context.py", line 199, in write_from_options
  File "/mnt/tmp/spark-53bfe016-e87e-4fb9-90d8-bea9225887e0/userFiles-3c08fed2-db6b-495c-a4d3-918277ee71a7/PyGlue.zip/awsglue/data_sink.py", line 32, in write
  File "/mnt/tmp/spark-53bfe016-e87e-4fb9-90d8-bea9225887e0/userFiles-3c08fed2-db6b-495c-a4d3-918277ee71a7/PyGlue.zip/awsglue/data_sink.py", line 28, in writeFrame
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Bucket name should not contain '_'"

SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main] java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST

I downloaded the latest linked distribution spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8 and checked out this repo just now:

$> bin/gluepyspark

[...maven builds...]

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  1.440 s
[INFO] Finished at: 2020-01-23T15:13:09+01:00
[INFO] ------------------------------------------------------------------------
mkdir: /Users/x/awslabs/aws-glue-libs/conf: File exists
/Users/x/awslabs/aws-glue-libs
Python 3.7.6 (default, Dec 30 2019, 19:38:26)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/x/awslabs/aws-glue-libs/jars/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/x/lib/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/01/23 15:13:10 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
	at scala.collection.MapLike$class.default(MapLike.scala:228)
	at scala.collection.AbstractMap.default(Map.scala:59)
	at scala.collection.MapLike$class.apply(MapLike.scala:141)
	at scala.collection.AbstractMap.apply(Map.scala:59)
	at org.apache.spark.api.python.PythonGatewayServer$$anonfun$main$1.apply$mcV$sp(PythonGatewayServer.scala:50)
	at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1262)
	at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:37)
	at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "/Users/x/lib/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8//python/pyspark/shell.py", line 38, in <module>
    SparkContext._ensure_initialized()
  File "/Users/x/lib/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/Users/x/lib/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "/Users/x/lib/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/python/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

Any idea what is missing here?

Broken Maven link

The link to Maven in README.md returns a 404:

Install apache maven from the following location: http://apache.mirrors.tds.net/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz

The same URL is used in the official docs.

$ wget http://apache.mirrors.tds.net/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz
--2019-09-03 13:20:58--  http://apache.mirrors.tds.net/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz
Resolving apache.mirrors.tds.net (apache.mirrors.tds.net)... 216.165.129.134
Connecting to apache.mirrors.tds.net (apache.mirrors.tds.net)|216.165.129.134|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2019-09-03 13:20:59 ERROR 404: Not Found.

Local glue fails due to collision on javax.servlet-3.0.0.v201112011016.jar

We are using aws-glue-local to allow us to run tests in CodeBuild as a part of our CICD pipeline. After an unrelated change was pushed yesterday, we found the job failing due to this error:

20/03/18 13:49:10 ERROR SparkContext: Error initializing SparkContext.
--
205 | java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package
206 | at java.lang.ClassLoader.checkCerts(ClassLoader.java:898)
207 | at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668)
208 | at java.lang.ClassLoader.defineClass(ClassLoader.java:761)
209 | at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
210 | at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
211 | at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
212 | at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
213 | at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
214 | at java.security.AccessController.doPrivileged(Native Method)
215 | at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
216 | at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
217 | at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
218 | at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
219 | at org.spark_project.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:154)
220 | at org.spark_project.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:146)
221 | at org.spark_project.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:140)
222 | at org.spark_project.jetty.servlet.ServletContextHandler.<init>(ServletContextHandler.java:110)
223 | at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:143)
224 | at org.apache.spark.ui.JettyUtils$.createServletHandler(JettyUtils.scala:130)
225 | at org.apache.spark.ui.WebUI.attachPage(WebUI.scala:83)
226 | at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:65)
227 | at org.apache.spark.ui.WebUI$$anonfun$attachTab$1.apply(WebUI.scala:65)
228 | at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
229 | at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
230 | at org.apache.spark.ui.WebUI.attachTab(WebUI.scala:65)
231 | at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:62)
232 | at org.apache.spark.ui.SparkUI.<init>(SparkUI.scala:80)
233 | at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:175)
234 | at org.apache.spark.SparkContext.<init>(SparkContext.scala:444)
235 | at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
236 | at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
237 | at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
238 | at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
239 | at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
240 | at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
241 | at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
242 | at py4j.Gateway.invoke(Gateway.java:238)
243 | at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
244 | at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
245 | at py4j.GatewayConnection.run(GatewayConnection.java:238)
246 | at java.lang.Thread.run(Thread.java:748)

After some digging online around this error, and looking at the jar files downloaded, I was able to resolve this issue by adding a command to our buildspec that removes the jar file javax.servlet-3.0.0.v201112011016.jar from the jarsv1 dir.

This had not happened on previous runs, and we did not see any changes in the glue repo or the pom file for GlueETL. This also did not happen when we used CodeBuild local, or when we ran glue lib locally on our laptops.

Integration with Jupyter/Notebooks

Seems there is no really easy way to hook up the local setup for use in a Jupyter notebook. I think I have made every possible attempt to get this to work. Is this intentional or is there a way to do this. It would be fantastic if local testing in a notebook environment was possible.

Union transform

I would be really nice to have a union transform on DynamicFrames. Right now it is necessary to use DataFrames for this operation. With a dedicated union transform, visualization of jobs also becomes better. Right now, we have to use join type for this.

How to run scala shell

Hi,
I followed instructions to setup Glue Development environment locally. I see that there is gluepyspark in bin directory which allows me to launch pyspark shell with Glue Libs.

Is there a way to launch spark shell same way with glue libraries in the path?

thanks
-Arul

decimal type losing precision after select_fields transformation

I have an RDS table with decimal(10,4) field. This tables uses like source for ETL. But select_fields transformation drops precision to decimal(5, 4) in following flow:

...
>>> datasource = glueContext.create_dynamic_frame.from_catalog(database = "<rds-source>", table_name = "<table>")
>>> datasource.schema
StructType([..., Field(Bounty, DecimalType(10, 4, {}), {}), ...], {})

>>> mapping = datasource.apply_mapping([..., ("bounty", "decimal(10,4)", "bounty", "decimal(10,4)"), ...])
>>> mapping.schema
StructType([..., Field(bounty, DecimalType(10, 4, {}), {}), ...], {})

>>> selected = mapping.select_fields([..., "bounty", ...])
>>> selected.schema
StructType([..., Field(bounty, DecimalType(5, 4, {}), {}), ...], {})
...

DynamicFrame.show(self, num_rows = 20) is not printing anything in Zeppelin output

I am using a Zeppelin Notebook connected with an AWS Glue Development Endpoint. I am reading data from redshift using GlueContext.create_dynamic_frame.from_catalog,

When I am calling show on that DynamicFrame, it doesn't print anything on Zeppelin output.

Calling printSchema is showing the schema, calling count is showing 96 records as well.
Also, converting DynamicFrame to DataFrame and then calling show successfully shows me 20 records from the data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.