Giter Club home page Giter Club logo

py-dataproc_job_wrapper's Introduction

py-dataproc

A Python wrapper around the Google Dataproc client.

Essentially, the Google Python client for Dataproc uses the generic googleapiclient interface, and is somewhat clunky to use. Until it's ported to the new google.cloud tooling, this class should make working with DataProc from Python a lot easier.

Setup

Clone the repo, create a virtualenv, install the requirements, and you're good to go!

virtualenv bin/env
source bin/env/bin/activate
python setup.py install

Setuptools should mean you can install directly from GitHub, by putting the following in your requirements file:

git+git://github.com/oli-hall/py-dataproc.git#egg=pydataproc

Requirements

This has been tested with Python 2.7, but should be Python 3 compatible. More thorough testing will follow.

Usage

N.B. You will need to auth your shell with Google Cloud before connecting. To do this (assuming you have gcloud CLI tools installed), run the following:

gcloud auth login

This will pop up an authentication dialogue in your browser of choice, and once you accept, it will authorise your shell with Google Cloud.

To initialise a client, provide the DataProc class with your Google Cloud Platform project ID, and optionally the region/zone you wish to connect to (defaults to europe-west1).

> dataproc = DataProc("gcp-project-id", region="europe-west1", zone="europe-west1-b")

Versions

The API has been updated significantly as of version 0.7.0. The documentation below includes the newer API first, but still gives a brief overview of the previous API for completeness.

Latest API

Working with clusters

Clusters are accessed through a single interface: clusters. This will give you access to a single cluster, by providing the cluster_name parameter, or all clusters.

To list all clusters:

> dataproc.clusters().list()

By default, this will return a minimal dictionary of cluster name -> cluster state. There is a boolean minimal flag which, if set to false, will give all the current cluster information returned by the underlying API, again keyed by cluster name.

Creating a cluster

Creating a cluster is where the DataProc client really becomes useful. Rather than specifying a giant dictionary, simply call create. This will allow configuration of machine types, number of instances, etc, but abstracts away the API details:

> dataproc.clusters().create(
    cluster_name="my-cluster",
    num_workers=4,
    worker_disk_gb=250
)

It also allows blocking for the cluster to be created. This method will return a Cluster object, which we'll dive into next.

Individual Clusters

Individual clusters can be queried and updated, through the Cluster class. You can create one of these either by calling clusters().create() or by calling clusters() with a cluster name parameter:

> dataproc.clusters("my-cluster")

N.B. If you call dataproc.clusters() with a non-existent cluster name, an Exception will be raised.

To fetch cluster information for a particular cluster, fetch the cluster, then call info:

> dataproc.clusters("my-cluster").info()

And similarly to fetch the status:

> dataproc.clusters("my-cluster").status()

This returns just the status of the cluster requested as a string. If you're just interested in knowing if it's running...

> dataproc.clusters("my-cluster").is_running()

Deleting a cluster:

> dataproc.clusters("my-cluster").delete()
Working with jobs

Submitting a job can be as simple or as complex as you want. You can simply specify a cluster, the Google Storage location of a pySpark job, and job args...

> dataproc.clusters("my-cluster").submit_job("gs://my_bucket/jobs/my_spark_job.py", args="-arg1 some_value -arg2 something_else")

...or you can submit the full job details dictionary yourself (see the DataProc API docs for more information).

Previous API - versions 0.6.2 and below

Working with existing clusters

There are a few different commands to work with existing clusters. You can list all clusters, get the state of a particular cluster, or return various pieces of information about said cluster.

To list all clusters:

> dataproc.list_clusters()

By default, this will return a minimal dictionary of cluster name -> cluster state. There is a boolean minimal flag which, if set to false, will give all the current cluster information returned by the underlying API, again keyed by cluster name.

To fetch cluster information for a particular cluster:

> dataproc.cluster_info("my-cluster")

And similarly to fetch the state:

> dataproc.cluster_state("my-cluster")

This returns just the state of the cluster requested as a string. If you're just interested in knowing if it's running...

> dataproc.is_running("my-cluster")
Creating a cluster

Creating a cluster is where the DataProc client really becomes useful. Rather than specifying a giant dictionary, simply call create_cluster. This will allow configuration of machine types, number of instances, etc, but abstracts away the API details:

> dataproc.create_cluster(
    cluster_name="my-cluster",
    num_workers=4,
    worker_disk_gb=250
)

It also allows blocking for the cluster to be created (N.B. currently this will block forever if the cluster fails to initialise correctly, so be careful!).

Deleting a cluster
> dataproc.delete_cluster("my-cluster")
Working with jobs

Submitting a job can be as simple or as complex as you want. You can simply specify a cluster, the Google Storage location of a pySpark job, and job args...

> dataproc.submit_job("my-cluster", "gs://my_bucket/jobs/my_spark_job.py", args="-arg1 some_value -arg2 something_else")

...or you can submit the full job details dictionary yourself (see the DataProc API docs for more information).

py-dataproc_job_wrapper's People

Contributors

oli-hall avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.