Giter Club home page Giter Club logo

Comments (14)

aaronsteers avatar aaronsteers commented on July 22, 2024 1

@dmateusp - I loved your idea of hosting a Dockerfile in dbt-spark and using this for containerized testing. I created the work-in-progress PR #55 which adds a Dockerfile based upon my past work and exploration on this topic.

@drewbanin - I'm very interested in your thoughts on this. Would you be interested in merging a Dockerfile and perhaps including that docker image (once complete) as a part of the repo?

from dbt-spark.

drewbanin avatar drewbanin commented on July 22, 2024

Hey @aaronsteers - really cool idea! This definitely is not currently supported, but I can imagine adding a new method to the Spark target config, local, which will attach to a locally running Spark context.

In your experience, are there are major differences between the SparkSQL that should run over http/thrift vs. calling spark.sql directly? It shouldn't be a problem if the SQL is identical.

from dbt-spark.

aaronsteers avatar aaronsteers commented on July 22, 2024

Hey, @drewbanin. As far as I'm aware, the SQL should be identical whether connecting via spark.sql() or via thrift - but it would probably require some testing, tbh. The only difference I could see is if the hive/thrift adapter were sending to a slightly different interpreter.

Thanks!

from dbt-spark.

dmateusp avatar dmateusp commented on July 22, 2024

Hi there! I agree that I think it would be great if we had a way to provide a local Spark-dbt environment. I've been looking into this issue myself and I wanted to discuss some implementation details/questions I have.

First, I think setting up a thrift server locally is quite hard, and I'm not sure there are other options to connect through a JDBC-like interface for Spark.

So I went down the route that @aaronsteers was suggesting, by looking at one of the possible shell interfaces to Spark (scala/python/java/R/spark-sql).

Ideally I would love to support all of them, by possibly sending commands to a subprocess and modifying the SQL sent to be wrapped in the appropriate programming interface, and writing the results to a temporary CSV file to fetch the results. This would not be performant but it would have the advantage of being flexible and not introducing extra-dependencies (the local environment's objective would be "sandbox" anyways!)

I wanted to try and connect to an existing REPL and send dynamically the code there from the DBT process but I didn't figure out how to do that yet! (the code sent would change depending on the "method" set in the profile (an enum of "spark", "java", "R" etc..)). Some IDEs offer that functionality, and I like the fact that we would be able to see SQL queries appear on the terminal.

The other way I could go about it, would be to launch the Spark shell as a subprocess, but then the user would have to configure the command to use to launch it, and he/she wouldn't be able to see the queries being sent to the REPL.

I'd appreciate any feedback on the approach so far, hopefully I'll have code to share soon (so far I did minor refactoring to put a structure in place for additional methods of running Spark)

from dbt-spark.

aaronsteers avatar aaronsteers commented on July 22, 2024

@dmateusp - I have gotten this working successfully in a docker container and I have gotten these two options to work:

  1. Run a docker container locally that hosts spark and thrift, then you can run DBT locally using the container's thrift port.
  2. Run dbt-spark from within a customized spark container. The container launches spark and then thrift and then runs some dbt tasks connecting to it's own thrift endpoint.

I have not yet found a way around the Thrift requirement, but if you already have a spark context, the code sample here might be helpful: https://github.com/slalom-ggp/dataops-tools/blob/1e36e3d09b99211e4223e436f2da825c117a92e8/slalom/dataops/sparkutils.py#L349-L352

In order to skip the Thrift requirement, I think one would have to replace references to pyhive with pyspark and then connect the cluster's spark endpoint (7077) instead of the thrift jdbc endpoint (10000). From a pyspark SparkSession object, I think we could just run spark.sql("SELECT * FROM FOOBAR") with basically the same result as is achieved with Thrift. (I'm not sure honestly if the behavior would be different at all, but definitely we could send SQL statements directly in that manner without using Thrift.)

from dbt-spark.

dmateusp avatar dmateusp commented on July 22, 2024

hey @aaronsteers thanks for sharing your approach,

I think we could benefit from having the README updated with some instruction to run it locally (using the Dockerized Thrift container), I did not find a docker container that worked out of the box with Spark and Thrift, could you share that image? (or maybe we could host it in the dbt-spark repo for future integration testing?)

The pyspark approach could be worth exploring!

I have a draft PR on my fork to show what I've been playing with, I use pexpect to wrap SQL produced by DBT into spark.sql() calls to a shell session. I think I explored it enough to say that it has many problems with getting Exception details, transmitting data and that using pyspark would be better. Also, pexpect has compatibility issues with Windows. (https://github.com/dmateusp/dbt-spark/pull/1/files)

If the Docker + Thrift approach is good enough in your opinion to play with dbt-spark locally, should we consider documenting this approach instead ? Because as you said, pyspark should not behave differently, but, it's still additional code that needs to be supported, tested and documented

from dbt-spark.

jtcohen6 avatar jtcohen6 commented on July 22, 2024

@aaronsteers Now that we've merged #58, are you okay with closing this issue?

from dbt-spark.

aaronsteers avatar aaronsteers commented on July 22, 2024

@jtcohen6 - The updates in #58 should do the trick in theory. That said, if it's okay with you, I'd still like to keep this open a little longer to test usability and documentation around this use case. I can try to get to it this week so we're not keeping this outstanding too long.

from dbt-spark.

jtcohen6 avatar jtcohen6 commented on July 22, 2024

Sure! No rush on my end

from dbt-spark.

chinwobble avatar chinwobble commented on July 22, 2024

@aaronsteers
this PR looks really good.
https://github.com/dmateusp/dbt-spark/pull/1/files

Is there anyway we can get this merged since I am looking for similar functionality so I can register UDFs in a sane way.

from dbt-spark.

Data-drone avatar Data-drone commented on July 22, 2024

What is the status on this now?

from dbt-spark.

ninomllr avatar ninomllr commented on July 22, 2024

Would really like to hear what the current state is on this? Basically how do I start thrift?

from dbt-spark.

ninomllr avatar ninomllr commented on July 22, 2024

I basically tried to start a thrift server like this from my Jupyter: https://stackoverflow.com/a/54223260 and then adding this to my profiles.yml without any luck.

I get the following error:

Could not connect to any of [('127.0.0.1', 443), ('::1', 443, 0, 0)]
07:10:29  Encountered an error:
Runtime Error
  Runtime Error
    Database Error
      failed to connect

profiles.yml

default:
  target: dev
  outputs:
    dev:
      type: spark
      method: thrift
      host: localhost
      schema: delta

Can someone tell me on how to start a local thrift server?

from dbt-spark.

github-actions avatar github-actions commented on July 22, 2024

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

from dbt-spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.