leobenkel / zparkio Goto Github PK

View Code? Open in Web Editor NEW

172.0 8.0 31.0 779 KB

Boiler plate framework to use Spark and ZIO together.

Home Page: https://github.com/leobenkel/ZparkIO

License: MIT License

Scala 92.12% Makefile 2.73% Shell 5.15%

spark zio scala functional-programming boiler-plate helpers template

zparkio's Introduction

ZparkIO

Boiler plate framework to use Spark and ZIO together.

The goal of this framework is to blend Spark and ZIO in an easy to use system for data engineers.

Allowing them to use Spark in a new, faster, more reliable way, leveraging ZIO power.

What is this library for ?
More About ZparkIO
- Public Presentation
- Migrate your Spark Project to ZparkIO
Why would you want to use ZIO and Spark together?
How to use?
- Include dependencies
  - Unit-test
- How to use in your code?
  - Main
  - Spark
  - Command lines
  - Helpers
  - Unit test
Examples
- Simple example
- More complex architecture
Authors
- Leo Benkel

Created by gh-md-toc

What is this library for ?

This library will implement all the boiler plate for you to be able to include Spark and ZIO in your ML project.

It can be tricky to use ZIO to save an instance of Spark to reuse in your code and this library solve all the boilerplate problem for you.

More About ZparkIO

Public Presentation

Feel free to look at the slides on Google Drive or on SlideShare presented during the ScalaSF meetup on Thursday, March 26, 2020. You can also watch the presentation on Youtube.

ZparkIO was on version 0.7.0, so things might be out of date.

Migrate your Spark Project to ZparkIO

Migrate from Plain Spark to ZparkIO

Why would you want to use ZIO and Spark together?

From my experience, using ZIO/Future in combination with Spark can speed up drastically the performance of your job. The reason being that sources (BigQuery, Postgresql, S3 files, etc...) can be fetch in parallel while the computation are not on hold. Obviously ZIO is much better than Future but it is harder to set up. Not anymore!

Some other nice aspect of ZIO is the error/exception handling as well as the build-in retry helpers. Which make retrying failed task a breath within Spark.

How to use?

I hope that you are now convinced that ZIO and Spark are a perfect match. Let's see how to use this Zparkio.

One of the easiest way to use ZparkIO is to use the giter8 template project:

sbt new leobenkel/zparkio.g8

Include dependencies

First include the library in your project:

libraryDependencies += "com.leobenkel" %% "zparkio" % "[SPARK_VERSION]_[VERSION]"

With version being: .

To checkout out the Spark Versions and the Version.

This library depends on Spark, ZIO and Scallop.

Unit-test

You can also add

libraryDependencies += "com.leobenkel" %% "zparkio-test" % "[VERSION]"

With version being: .

To get access to helper function to help you write unit tests.

How to use in your code?

There is a project example you can look at. But here are the details.

Main

The first thing you have to do is extends the ZparkioApp trait. For an example you can look at the ProjectExample: Application.

Spark

By using this architecture, you will have access to SparkSesion anywhere in your ZIO code, via

import com.leobenkel.zparkio.Services._

for {
  spark <- SparkModule()
} yield {
  ???
}

for instance you can see its use here.

Command lines

You will also have access to all your command lines automatically parsed, generated and accessible to you via:

CommandLineArguments ; it is recommended to make this helper function to make the rest of your code easier to use.

Then using it, like here, is easy.

Helpers

In the implicits object, that you can include everywhere. You are getting specific helper functions to help streamline your projects.

Unit test

Using this architecture will literally allow you to run your main as a unit test.

Examples

Simple example

Take a look at the simple project example to see example of working code using this library: SimpleProject.

More complex architecture

A full-fledged, production-ready project will obviously need more code than the simple example. For this purpose, and upon suggestion of several awesome people, I added a more complex project. This is a WIP and more will be added as I go. MoreComplexProject.

Authors

Leo Benkel

Alternatives

univalence/zio-spark

zparkio's People

Contributors

Stargazers

Watchers

zparkio's Issues

Make a Giter8 template

That would be nice to have a Giter8 template to get started with a Zparkio project.
http://www.foundweekends.org/giter8/

Any chance you could target Scala 2.13?

Currently only targets 2.11 and 2.12

Thanks

More examples

Write better example to illustrate good project structure.

user.scala - for each data type

case class User () {}

object User {
   def fetchFromSource: ZDS_R[Source, User] = ???
}

and transformations:

object Transformations {
    def transform(input: Dataset[A]): ZDS[B] = ???
}

and sources:

trait Sources {
    def getUser
}
object Sources {
  object Live extends Sources {
     def getUser = User.fetchFromSources
  }
}

in Application:

trait Application {
    def getSources: Sources
}

in main.scala:

object Main extends Application {
   override final lazy val getSources = Source.Live
}

in tests:

object TestApp extends Application {
      override final lazy val getSources = FakeDataSources
}

Logger only prints to the console. not possible to integrate with actual logging framework

Currently loggerFactory is fixed to depend on Console.Service. It would be nice that we can integrate with logging framework

https://github.com/zio/zio-logging

CLI - Scallop

Put Scallop specific code in its own component and make the default version more generic

Add support for Spark 3

The latest Databricks runtime supports Spark 3.0.1. To enable adoption across the growing Spark community on Azure, Zparkio should support the third major release of Spark.

Broken links in the slide #28

In the README,

Feel free to look at the slides on [Google Drive](https://docs.google.com/presentation/d/1gyFJpH2mzJ9ghSTsIMrUHWA9rCtSn2ML9ERUFvuYSp8)

In the side #28, there are 2 links -

https://github.com/leobenkel/ZparkIO/tree/master/ProjectExample/src/main/scala/com/leobenkel/zparkioProjectExample
https://github.com/leobenkel/ZparkIO/tree/master/ProjectExample_MoreComplex/src/main/scala/com/leobenkel/zparkioProfileExampleMoreComplex

Both 404 "Page not found".

Add more helper functions

As suggested https://www.reddit.com/r/scala/comments/e67f18/zio_spark_zparkio/f9pjgge?utm_source=share&utm_medium=web2x , https://github.com/univalence/spark-tools/tree/master/spark-zio should be imported to take advantage of the built shortcuts

ZDataset

Make a wrapper around the Spark dataset that could control caching as well as offer both set of features (dataset and rdd) under one class

Project Example is returning Github 404 and support for spark 3.0

Hello, in the Readme when I click in Project Exampl/Project Example Application, the return is a 404 Github Page:

Lastly, do you have plans to support spark 3.0 ?

Pointing to univalence/zio-spark

Hello @leobenkel,

Could you drop a link to our wrapper for spark somewhere in your Readme?
https://github.com/univalence/zio-spark

We tried for the past few years to see if we could leverage the way you manage configuration and all. However, we are not finding something that fits.

We listed your project a long time ago as an alternative framework, if you would, could do the same, listing us as an alternative library, that would be great.

Cheers,
Jonathan

Shortcut for map and flatmap

Right now a type ZIO[..., Dataset[A]] will need a double map to change to ZIO[..., Dataset[B]]

.map(_.map(...))

That would be nice to have a map shortcut that convert directly from ZDS[A] to ZDS[B].

[TODO] Convert to zio-test

convert test from scalatest to zio-test

fix coverage

rewards-network/scanamo@e2cbdb4#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5

Remove all need for `import spark.implicits._` when using case classes

Build for all Spark version

It would be great to publish a version for each Spark version ( 2.3.x , 2.4.x , etc.. ) instead of having a hardcoded version like now.

migration tools

Make more migration tools to move easily from a none-zio / future code base into a ZparkIO one.

leverage Iskra syntax

https://github.com/VirtusLab/iskra

NoClassDefFoundError when I try to run example 1.

I tried to clone your project and run example 1 main. I got:

java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
at com.leobenkel.zparkio.Services.SparkModule$Factory$class.com$leobenkel$zparkio$Services$SparkModule$Factory$$sparkBuilder(SparkModule.scala:21)

Please let me know if I need to setup something? Or it's a bug?

When I run the application I found delta-core is not working with 3.1.x spark. And if I upgrade to 3.2.x and run again I found the following issue.

[error] stack trace is suppressed; run 'last update' for the full output
[error] stack trace is suppressed; run 'last ssExtractDependencies' for the full output
[error] (update) lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:
[error] https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.10.2/avro-mapred-1.10.2-hadoop2.jar: not found: https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.10.2/avro-mapred-1.10.2-hadoop2.jar
[error] (ssExtractDependencies) lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:
[error] https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.10.2**/avro-mapred-1.10.2-hadoop2.ja**r: not found: https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.10.2/**avro-mapred-1.10.2-hadoop2.jar**
[error] Total time: 4 s, completed Dec 10, 2022, 9:37:14 AM

Support / Package for Spark 3.1.2

Hey there 👋 and thanks for this nice library 👍

Do you plan to create a version for spark 3.1.2? Would be happy to have it :-)

Kind regards