Giter Club home page Giter Club logo

databricks-maven-plugin's People

Contributors

javamonkey79 avatar joongho avatar kenmy avatar mikhailkavaliou avatar minikill avatar samshuster avatar timedm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

databricks-maven-plugin's Issues

Add support for Azure

I already validated some ops work on Azure Databricks like uploading the jar to dbfs.
Creating clusters doesn't work due to the cluster property aws_attributes.

Did you ever considered extending the support for Azure Databricks as well ?

Add information to the README.md on how to build immutable, deployable artifacts

Overview

Currently the README goes over how to use the basic functionality as part of development or as part of the build cycle, but it doesn't demonstrate how it can be used to construct artifacts that can be deployed to databricks without the source / pom.xml (the NO PROJECT mojos or NP mojos)

The goal of this story is to provide examples on how to achieve this leveraging how it is done at Edmunds.com.

It would also be good to use this opportunity to provide sample deployment pipelines.

split up BaseWorkspaceMojo

Notes

Please see gitlab note, here.

Recommended approach: PrepareDBResources should be split into 2, one for jobs, one for workspaces

AC

workspace mojo should no longer extend from job mojo

s3 uploads missing permissions

The upload-to-s3 mojo uploads data with no ACL set. Because of this, it is possible to write from an outside account, where the bucket owner has no permissions on the artifact.

Ability to utilize proxy maven settings

Hello,

Great plugin you have here! I would like to use it, but unfortunately my organisation operates behind a HTTP proxy which also requires authentication.
I noticed that the databricks-rest-client uses HttpClient which needs to call the API in a specific way for proxies to work. Wondering if you could use the Proxy configuration in the user's .m2/settings.xml if it is present, and if it is desired to be used to call the databricks REST API.

Many thanks,
Jason

Normalize Test names

Currently we have no standards when it comes to test names. I believe we should go for this standard:

{method}_{situation}_{expected outcome}

Like so:
isAdult_AgeLessThan18_False
withdrawMoney_InvalidAccount_ExceptionThrown
admitStudent_MissingMandatoryFields_FailToAdmit

Create example documentation describing sample workflows

Acceptance Criteria:

We will want to comply to this standard https://maven.apache.org/guides/development/guide-plugin-documentation.html

Two goals are to:

  1. provide documentation per a Mojo and command. This should be done in the usage.apt
  2. provide documentation for common workflows that utilize multiple commands. This should be done in example pages.

I think the second case is more important because a lot of the goal specific documentation is auto-generated for us.

Users should be able to look at these example pages to see:

  1. ideas on how they can use profiles to configure most of the boiler plate
  2. ideas on how to have different environments
  3. ideas on what commands should be used as part of a deployment
  4. common development and testing commands

Deploy of a notebook doesn't work from Windows

When I deploy the notebook the checks are always failed because path separators in Windows differ with path separator from other OSs.
The stacktrace:

Expected: [group.name] but found: [\group.name\project-name]
	at com.edmunds.tools.databricks.maven.validation.ValidationUtil.validatePart(ValidationUtil.java:98)
	at com.edmunds.tools.databricks.maven.validation.ValidationUtil.validatePath(ValidationUtil.java:66)
	at com.edmunds.tools.databricks.maven.BaseWorkspaceMojo.validateNotebooks(BaseWorkspaceMojo.java:90)
	at com.edmunds.tools.databricks.maven.PrepareDbResources.prepareNotebooks(PrepareDbResources.java:61)
	at com.edmunds.tools.databricks.maven.PrepareDbResources.execute(PrepareDbResources.java:44)

Deployable Workspace Libraries

Goal

For a library like dwh-databricks-common, it would be easiest for users if it was a library available on workspace. This way users could easily install it to any cluster they want.

This is very different then a productionalized interactive cluster or job that needs a predefined set of libraries which is functionality that already exists. The difference here is that it would be used on adhoc clusters that should not be managed programmatically and we want it easy for users to be able to find common code and install the most up to date version.

Requirements

There is functionality for maven-plugin to deploy a library as a standalone library on databricks workspace.

default clusters that it is installed to would be great.

It doesn’t need to restart clusters, I think having it be manual is fine.

Feature request: make restarting running clusters optional after library installation?

We would like to use the plug-in as part of the deploy stage of our build/CI process to install/update a Jar as a databricks library. Since builds can happen at anytime (triggered by a git commit), for the users this would mean seemingly random cluster restarts, potentially losing lots of intermediate data/work and so (understantably IMHO) the idea has been rejected by the users.

If the restart could be made optional via a plugin configuration item, this would be great. We already have a notification system to tell all users that a new version has been deployed. Users could then restart whenever it suits them and pick up the latest version after the restart.

I realize I've been requesting a lot over the last 2 days: 2 issues for databrick-maven-plugin and one issue for databricks-rest-client. But I'm really keen to integrate our CD/CI process with databricks and these libraries provide 98% of what I need. So I'm happy to help out and submit a pull request for this feature if it would be helpful?

Regards! Dara.

Remove delta mandatory properties

Acceptance Criteria

Please remove spark.databricks.delta.preview.enabled and the delta tag from the whole project as this turned out to be temporary code that shouldn't be forced on users.

Handle alternative scenarios at ClusterMojo

Now ClusterMojo (start/stop commands) doesn't handle situations like "PENDING" state of target cluster

Acceptance Criteria

start/stop cluster command works properly for any cluster state

ClusterSettingsDTO refactoring

@JsonProperty("artifact_paths")
private Collection<String> artifactPaths;

Consider this property (libraries to deploy) movement from ClusterSettingsDTO (to make this class redundant) into UpsertClusterDTO (or even NewClusterDTO).

It requires changes at databricks-rest-client project:
edmunds/databricks-rest-client#42

Since version 1.3.1 Library mojos class not found google.common.util.concurrent.Uninterruptibles

To reproduce:

mvn -X databricks:library -Dlibrary.command=INSTALL -Dclusters=your_cluster
Caused by: java.lang.ClassNotFoundException: com.google.common.util.concurrent.Uninterruptibles
        at org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy.loadClass(SelfFirstStrategy.java:50)
        at org.codehaus.plexus.classworlds.realm.ClassRealm.unsynchronizedLoadClass(ClassRealm.java:271)
        at org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:247)
        at org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:239)
        ... 26 more

You should see the above

A fix to this is to add a dependency to the plugin like so:

            <plugin>
                <groupId>com.edmunds</groupId>
                <artifactId>databricks-maven-plugin</artifactId>
                <dependencies>
                    <dependency>
                        <groupId>com.google.guava</groupId>
                        <artifactId>guava</artifactId>
                        <version>16.0.1</version>
                    </dependency>
                </dependencies>
            </plugin>

But ideally, this dependency doesn't need to be defined like above and it was not necessary before 1.3.1

Maven Plugin should not force a jar to be attached

Overview

I found out that the plugin forces the job description to have the project jar attached as a library.

For this project, it doesn’t make sense for this to be the case.

Acceptance Criteria

  • It is possible to have jobs that don't require a jar

One possible way of doing this could be to be able to override the libraries section in the job settings file. Or perhaps it is a flag?

Properties for each MOJO are clearly specified in documentation

Requirements

  • All properties that can be set for a given mojo are easily visible in documentation.

Notes

There was an attempt at this using maven standards to automate this (please see maven-site-plugin that added) however, this documentation does not appear to be visible when clicking on javadocs.

If you run mvn clean site and then go to for example:
http://localhost:63343/databricks-maven-plugin/target/site/upsert-cluster-np-mojo.html

You can see exactly what is needed.

The problem is this documentation cannot be accessed on hosted javadocs.

add proposed architecture for plugin in readme

Based on conversation with Shaun, we want to go with a mojo per "service" and then command pattern used inside. Unfortunately currently we are not consistent. Document this in readme.

add databricks job status mojo

Narrative
As an engineer who uses the databricks maven plugin
I'd like to be able to monitor databricks jobs from the cli

AC
a new mojo that reports the status of a job name (default to job name found in job settings file)

a flag to "follow" the job to termination (perhaps set by default)

Environment property should not be persisted in the job-template.json

BUG

Overview

Environment is not a property that should be part of a build that is persisted in this way. It should always be passed in via a property to the build.

Result is that the output of job-template.json no longer needs to contain the "environment" property.
Jobs with freemarker references to environment should still work as expected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.