Giter Club home page Giter Club logo

sagemaker-studio-sparkmagic-lib's Introduction

SageMaker SparkMagic Library

Version Code style: black

This is a CLI tool for generating configuration of SparkMagic, Kerberos required to connect to EMR cluster. In particular, it generates following two files

  1. SparkMagic Config: This config file contains information needed to connect SparkMagic kernel's running on studio to Livy application running on EMR. CLI obtains EMR cluster details like Ip address etc.. by describing EMR cluster

  2. Krb5.conf: If EMR cluster uses kerberos security configuration, this library also generates krb5.conf needed for user authentication on studio

Usage

This CLI tool comes pre-installed on Studio SparkMagic Image. It can be used from any notebook created from that image.

Connecting to non-kerberos cluster:

In a notebook cell, execute following commands

%local

!sm-sparkmagic connect --cluster-id "j-xxxxxxxxx"

sample output:

Successfully read emr cluster(j-xxxxxxxx) details
SparkMagic config file is written to location /etc/sparkmagic/config.json
Completed setting up configuration files for SparkMagic to connect to EMR cluster j-xxxxxxxx


Please complete following steps to complete the connection
1. Restart kernel to complete your setup. This is required so SparkMagic can pickup generated configuration

Connecting to kerberos cluster:

It's very similar to non-kerberos cluster, except you can pass

!sm-sparkmagic connect --cluster-id "j-xxxxxxxx" --user-name "ec2-user"

sample output:

Please follow below steps to complete the setup:
1. Please open image terminal and run 'kinit ec2-user'(user_name: ec2-user) to get kerberos ticket
2. Restart kernel to complete your setup. This is required so SparkMagic can pickup generated configuration

Connecting to EMR cluster in another account

To setup configuration for EMR cluster in another account, run following command

%local

!sm-sparkmagic connect --cluster-id "j-xxxxx" --role-arn "arn:aws:iam::222222222222:role/role-on-emr-cluster-account"

Connecting to EMR cluster in a private subnet over VPC Endpoints

There is a bug in botocore which requires the user to override the endpoint for EMR clients when using over VPC Endpoints. As this library uses the default boto3 configuration, this may cause issues while connecting to clusters over VPC Endpoints.

As a workaround, run the following code snippet to override the default EMR endpoint in boto3

%local
import botocore
import json
import os

with open(os.path.join(os.path.dirname(botocore.__file__), 'data', 'endpoints.json'), 'r+') as f:
    data = json.load(f)
    # Use [1] for aws-cn
    data['partitions'][0]['services']['elasticmapreduce']['defaults']['sslCommonName'] = '{service}.{region}.{dnsSuffix}'
    f.seek(0)
    json.dump(data, f)
    f.truncate()

FAQ

  • Can I connect to multiple clusters at same time?
    • You can only connect to one cluster at a time. Tool generates configuration needed to connect to one cluster. If you want to connect to different cluster, one has to re-execute the command providing different cell
  • Can I use this CLI on non-SparkMagic image on studio?
    • This cli only comes pre-installed on SparkMagic Image. One can install on any other image if needed
  • Can I use this library on SageMaker Notebook instances?
    • It does not come installed on Notebooks either, but you can install and try using it. You may have to relocate SparkMagic conf file

Installing

Install the CLI using pip.

pip install sagemaker-studio-sparkmagic-lib

Following extra permissions are required on the role to be able to describe cluster

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "elasticmapreduce:DescribeSecurityConfiguration",
                "elasticmapreduce:ListInstances"
            ],
            "Resource": "arn:aws:elasticmapreduce:*:*:cluster/*"
        }
    ]
}

Development

  • checkout the repository, and install locally
make install
  • To test locally, you can start python3 REPL and run following python code
import sagemaker_studio_sparkmagic_lib.sparkmagic as sm
sm.connect_to_emr_cluster(cluster_id= "j-xxx", user_name="ec2-user", krb_file_override_path="/tmp/krb5.conf",
     spark_magic_override_path="/tmp/config.json", restart_kernel=False)

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.