Comments (14)
@dmateusp - I loved your idea of hosting a Dockerfile in dbt-spark and using this for containerized testing. I created the work-in-progress PR #55 which adds a Dockerfile based upon my past work and exploration on this topic.
@drewbanin - I'm very interested in your thoughts on this. Would you be interested in merging a Dockerfile and perhaps including that docker image (once complete) as a part of the repo?
from dbt-spark.
Hey @aaronsteers - really cool idea! This definitely is not currently supported, but I can imagine adding a new method to the Spark target config, local
, which will attach to a locally running Spark context.
In your experience, are there are major differences between the SparkSQL that should run over http/thrift vs. calling spark.sql
directly? It shouldn't be a problem if the SQL is identical.
from dbt-spark.
Hey, @drewbanin. As far as I'm aware, the SQL should be identical whether connecting via spark.sql() or via thrift - but it would probably require some testing, tbh. The only difference I could see is if the hive/thrift adapter were sending to a slightly different interpreter.
Thanks!
from dbt-spark.
Hi there! I agree that I think it would be great if we had a way to provide a local Spark-dbt environment. I've been looking into this issue myself and I wanted to discuss some implementation details/questions I have.
First, I think setting up a thrift server locally is quite hard, and I'm not sure there are other options to connect through a JDBC-like interface for Spark.
So I went down the route that @aaronsteers was suggesting, by looking at one of the possible shell interfaces to Spark (scala/python/java/R/spark-sql).
Ideally I would love to support all of them, by possibly sending commands to a subprocess and modifying the SQL sent to be wrapped in the appropriate programming interface, and writing the results to a temporary CSV file to fetch the results. This would not be performant but it would have the advantage of being flexible and not introducing extra-dependencies (the local environment's objective would be "sandbox" anyways!)
I wanted to try and connect to an existing REPL and send dynamically the code there from the DBT process but I didn't figure out how to do that yet! (the code sent would change depending on the "method" set in the profile (an enum of "spark", "java", "R" etc..)). Some IDEs offer that functionality, and I like the fact that we would be able to see SQL queries appear on the terminal.
The other way I could go about it, would be to launch the Spark shell as a subprocess, but then the user would have to configure the command to use to launch it, and he/she wouldn't be able to see the queries being sent to the REPL.
I'd appreciate any feedback on the approach so far, hopefully I'll have code to share soon (so far I did minor refactoring to put a structure in place for additional methods of running Spark)
from dbt-spark.
@dmateusp - I have gotten this working successfully in a docker container and I have gotten these two options to work:
- Run a docker container locally that hosts spark and thrift, then you can run DBT locally using the container's thrift port.
- Run dbt-spark from within a customized spark container. The container launches spark and then thrift and then runs some dbt tasks connecting to it's own thrift endpoint.
I have not yet found a way around the Thrift requirement, but if you already have a spark context, the code sample here might be helpful: https://github.com/slalom-ggp/dataops-tools/blob/1e36e3d09b99211e4223e436f2da825c117a92e8/slalom/dataops/sparkutils.py#L349-L352
In order to skip the Thrift requirement, I think one would have to replace references to pyhive with pyspark and then connect the cluster's spark endpoint (7077) instead of the thrift jdbc endpoint (10000). From a pyspark SparkSession
object, I think we could just run spark.sql("SELECT * FROM FOOBAR")
with basically the same result as is achieved with Thrift. (I'm not sure honestly if the behavior would be different at all, but definitely we could send SQL statements directly in that manner without using Thrift.)
from dbt-spark.
hey @aaronsteers thanks for sharing your approach,
I think we could benefit from having the README updated with some instruction to run it locally (using the Dockerized Thrift container), I did not find a docker container that worked out of the box with Spark and Thrift, could you share that image? (or maybe we could host it in the dbt-spark repo for future integration testing?)
The pyspark approach could be worth exploring!
I have a draft PR on my fork to show what I've been playing with, I use pexpect
to wrap SQL produced by DBT into spark.sql()
calls to a shell session. I think I explored it enough to say that it has many problems with getting Exception details, transmitting data and that using pyspark would be better. Also, pexpect
has compatibility issues with Windows. (https://github.com/dmateusp/dbt-spark/pull/1/files)
If the Docker + Thrift approach is good enough in your opinion to play with dbt-spark locally, should we consider documenting this approach instead ? Because as you said, pyspark should not behave differently, but, it's still additional code that needs to be supported, tested and documented
from dbt-spark.
@aaronsteers Now that we've merged #58, are you okay with closing this issue?
from dbt-spark.
@jtcohen6 - The updates in #58 should do the trick in theory. That said, if it's okay with you, I'd still like to keep this open a little longer to test usability and documentation around this use case. I can try to get to it this week so we're not keeping this outstanding too long.
from dbt-spark.
Sure! No rush on my end
from dbt-spark.
@aaronsteers
this PR looks really good.
https://github.com/dmateusp/dbt-spark/pull/1/files
Is there anyway we can get this merged since I am looking for similar functionality so I can register UDFs in a sane way.
from dbt-spark.
What is the status on this now?
from dbt-spark.
Would really like to hear what the current state is on this? Basically how do I start thrift?
from dbt-spark.
I basically tried to start a thrift server like this from my Jupyter: https://stackoverflow.com/a/54223260 and then adding this to my profiles.yml without any luck.
I get the following error:
Could not connect to any of [('127.0.0.1', 443), ('::1', 443, 0, 0)]
07:10:29 Encountered an error:
Runtime Error
Runtime Error
Database Error
failed to connect
profiles.yml
default:
target: dev
outputs:
dev:
type: spark
method: thrift
host: localhost
schema: delta
Can someone tell me on how to start a local thrift server?
from dbt-spark.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
from dbt-spark.
Related Issues (20)
- [ADAP-913] Support Metadata Freshness
- [ADAP-920] [ADAP-919] [Bug] Delta table metadata changed/concurrent update HOT 1
- [ADAP-930] [Feature] Implement relation filtering on get_catalog macro
- [ADAP-931] [Bug] Values in seeds that should convert to `null` aren't working for `session` connection method HOT 3
- [ADAP-946] [CT-2689] [Bug] Incremental models ran from scratch when created in a pre-hook HOT 9
- [ADAP-955] [Feature] Add debug logging for driver/connector packages HOT 2
- [ADAP-970] [Feature] Incremental updates should update table description HOT 3
- [ADAP-999] [Feature] add support for Apache Paimon format HOT 1
- [ADAP-1012] Support for new agate data type in Spark
- [ADAP-1018] [Feature] Remove Databricks test profiles from integration tests HOT 2
- [ADAP-1019] [Bug] Table already exists, you need to drop it first in incremental models HOT 1
- [ADAP-1038] [Tests] Add tests for --empty flag
- [ADAP-1048] [Bug] Replacing existing table using incremental model HOT 1
- [ADAP-1071] [Bug] `latest` and `1.x.latest` tags for ghcr Docker releases are stale HOT 1
- [ADAP-1074] [Implementation] Remove `invalid_insert_overwrite_delta_msg` message
- [ADAP-1085] [Bug] When using iceberg format, dbt docs generate is unable to populate the columns information HOT 1
- [ADAP-1093] [Feature] Run integration tests against all supported python versions
- [Feature] Support HTTP transport protocol for Thrift method
- [Feature] Support OCI Dataflow as a backend for dbt-spark
- `dbt-core` Dockerfile does not work for `dbt-spark` due to `PyHive` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-spark.