Comments (7)
Great find! This looks promising and I imagine it could create significant performance benefits. And thank you for the link to the source code. I traced the blame on this line and it looks like the latest commit was >3 years ago, with even the prior version of that line shown here appearing to support the same syntax.
With that said, my guess is that support for this would likely correlate with spark version number moreso than with vendor, and it appears this has been in at least since version Spark 2.2 and likely longer. (Someone else jump in if you have additional/different info.)
For my part, I think this is a relatively safe bet and likely worth the performance boost. Although due to the noted lack of documentation, I also think some type of safe failover or feature flag might be advisable.
from dbt-spark.
Thanks for the pointer to that, @aaronsteers!
I have updated my original issue comment to reflect the issue on my end (faulty/outdated JDBC driver) that was causing me to encounter errors with variants of describe
. That said, show table extended in my_db like '*'
is still the closest thing we have to an information schema that we can access all at once.
If it works across the board, I think it offers a more performant approach to the get_catalog
updates in #39 and #41, versus running describe table extended
for every relation in the project. The difficulty there is in parsing the information
column, which is a big string delimited by \n
, rather than additional rows per property as in describe table extended
.
from dbt-spark.
Related: This feature is a long way out (at best), but here's the last and best reference I could find to the feature request to add INFORMATION_SCHEMA
support natively. You can vote and review that Spark Jira issue here: https://issues.apache.org/jira/browse/SPARK-16452 . We cannot block on this feature but I thought it would be helpful to flag it at least here for general awareness.
from dbt-spark.
The difficulty there is in parsing the
information
column, which is a big string delimited by\n
, rather than additional rows per property as indescribe table extended
.
I think a regex approach could probably make quick work of the column names and data types. I will take a quick stab at that and post back here.
from dbt-spark.
This regex search string seems to work on the sample output from above:
\|-- (.*): (.*) \(nullable = (.*)\b
This regex string outputs the three captured pairs:
- column name
- column type
- nullable (true/false)
Link to test results and demo of this regex: https://regex101.com/r/E5YHCs/1
from dbt-spark.
Very nice!
As far as the
more performant approach to the
get_catalog
updates
that I mentioned above, I think a natural fit here is the _get_one_catalog
method that is introduced in dbt-labs/dbt-core#2037 and will hopefully ship in the 0.16.0 release of dbt Core.
Between now and then, we can still try to ship the proposed enhancements to get_catalog
as written (looping across all relations and running describe ... extended
) in the 0.15.0 release of this plugin.
from dbt-spark.
I'm going to close this and open a more specific issue that suggests reimplementing _get_one_catalog
to operate more efficiently on one schema at a time, instead of one relation at a time.
from dbt-spark.
Related Issues (20)
- [ADAP-920] [ADAP-919] [Bug] Delta table metadata changed/concurrent update HOT 1
- [ADAP-930] [Feature] Implement relation filtering on get_catalog macro
- [ADAP-931] [Bug] Values in seeds that should convert to `null` aren't working for `session` connection method HOT 3
- [ADAP-946] [CT-2689] [Bug] Incremental models ran from scratch when created in a pre-hook HOT 9
- [ADAP-955] [Feature] Add debug logging for driver/connector packages HOT 2
- [ADAP-970] [Feature] Incremental updates should update table description HOT 3
- [ADAP-999] [Feature] add support for Apache Paimon format HOT 1
- [ADAP-1012] Support for new agate data type in Spark
- [ADAP-1018] [Feature] Remove Databricks test profiles from integration tests HOT 2
- [ADAP-1019] [Bug] Table already exists, you need to drop it first in incremental models HOT 1
- [ADAP-1038] [Tests] Add tests for --empty flag
- [ADAP-1048] [Bug] Replacing existing table using incremental model HOT 1
- [ADAP-1071] [Bug] `latest` and `1.x.latest` tags for ghcr Docker releases are stale HOT 1
- [ADAP-1074] [Implementation] Remove `invalid_insert_overwrite_delta_msg` message
- [ADAP-1085] [Bug] When using iceberg format, dbt docs generate is unable to populate the columns information HOT 1
- [ADAP-1093] [Feature] Run integration tests against all supported python versions
- [Feature] Support HTTP transport protocol for Thrift method
- [Feature] Support OCI Dataflow as a backend for dbt-spark
- `dbt-core` Dockerfile does not work for `dbt-spark` due to `PyHive` HOT 2
- [Bug] CI is broken on `main` due to dependency resolution and timeout issues HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-spark.