Giter Club home page Giter Club logo

spark-scala-template's Introduction

Spark Scala Template

Build Status

codecov

Assortment of default settings, best practices, and general goodies for Scala projects.

Layers

There are 3 layers:

  • The reusable shared DAG steps, implemented via (SQL like) Dataset API
  • The user defined functions used by the steps
  • A command with cli, that builds a whole DAG by combining steps

The cli

-d, --debug                  Explains plan during DAG construction
-i, --input  <arg>           Path to the raw data to process (local, hdfs, s3)
-l, --limit  <arg>           Limit DAG steps to given number, the read and
                             write/show steps are always added
    --lines-to-show  <arg>   Amount of lines to shows to the console (instead
                             of writing snappy compressed parquet files)
-n, --nodes  <arg>           Spark nodes (local run) (default = local[*])
-o, --output  <arg>          Output path (local, hdfs, s3)
-s, --stay                   Wait for key press to exit (to keep SparkSession
                             and webserver running while debugging)
    --help                   Show help message

Testing

There are several ways to test the code:

  • With local in memory cluster and included data, e.g.:
    • A whole DAG ./sbt -i data/*.gz -o test
    • A reduced DAG ./sbt -i data/*.gz -o test --limit 1 --lines 2
  • Sherlock Holmes mode: Start the scala console via ./sbt console, there is a preconfigured SparkSession named spark and all the Steps are imported.
  • Execute the automated tests via: ./sbt test and ./sbt it:test, or with coverage report: ./sbt coverage test coverageReport. The report can be found in target/scala-2.11/scoverage-report.
  • Submitting a Spark job with any desired cli argument on a running cluster or import the code into Apache Zeppelin and use it there.

Releasing

The versioning is done via Annotated Git Tags, see git-describe and Git-Basics-Tagging for annotated tags.

Tags have to be pushed via git push --tags.

This requires a full repo clone on the continuous integration machine (no shallow clone).

The benefit of git describe based versioning is, that every commit has an distinct automatic version. This facilitates also continuous integration and delivery.

sbt bash wrapper

The sbt bash wrapper from https://github.com/paulp/sbt-extras is included as ./sbt to reduce the build dependencies to OpenJDK 8 and bash.

Scala REPL

The console and consoleQuick sbt tasks start the Scala REPL with a few commonly used imports such as JavaConverters, collection.mutable, Future, Random, Try and NonFatal already in scope.

The REPL also defines a desugar macro, that expands an expression to its desugared form and inferred type. Try, for instance desugar(1 to 3).

Scalastyle

The scalastyle and test:scalastyle sbt tasks are used to check source and test code with Scalastyle. The error list is saved to Checkstyle-compatible files target/scalastyle-result.xml and target/scalastyle-test-result.xml respectively.

Scalastyle runs automatically against source and test code with the sbt test task.

It is not recommended to make the compile task dependent on Scalastyle. Since Scalastyle runs first and fails if the code does not compile, one would not get the Scala compiler error messages.

scoverage

To execute tests with code coverage enabled run the following sbt tasks: clean coverage test coverageReport. Coverage reports are saved to target/scala-2.11/scoverage-report/.

The coverage command is sticky. To turn it off, use coverageOff.

Scalariform

To format source and test code run the scalariformFormat and test:scalariformFormat sbt tasks.

Scalariform is provided as a convenience and starting point; it is not sufficient to be fully compliant with the Scala Style Guide.

sbt-updates

There is a dependencyUpdates sbt task to check Maven repositories for dependency updates and a pluginUpdates task to check for sbt plugin updates.

sbt-license-report

To generate a report with the dependency licenses used by the project, use the dumpLicenseReport sbt task. The report is saved to the target/license-reports directory.

dependency tree

The task coursierDependencyTree displays the project dependencies as a tree and coursierDependencyInverseTree shows everything that depends on an artifact.

sbt-stats

The stats sbt task shows basic line counting statistics.

sbteclipse

The eclipse sbt task to create Eclipse project definitions.

ensime-sbt

The ensimeConfig task to generate .ensime config files.

sbt-api-mappings

Fills apiMappings for common Scala libraries during doc task.

License

Copyright 2011-2016 Marconi Lanna
Copyright 2017 Daniel Bast

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

spark-scala-template's People

Contributors

marconilanna avatar dbast avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.