Giter Club home page Giter Club logo

Comments (17)

pgr0ss avatar pgr0ss commented on July 22, 2024 2

AWS support helped me figure out the issue:

On my EMR cluster, port 10000 is for Hive and 10001 is for Spark. When I changed to 10001 it worked (after running start-thriftserver.sh).

@rhousewright Should we maybe mention this port difference in the docs as part of your PR? https://github.com/fishtown-analytics/dbt-spark/pull/20/files#diff-04c6e90faac2675aa89e2176d2eec7d8R22

Here's my profile now:

default:
  target: dev
  outputs:
    dev:
      type: spark
      method: thrift
      schema: experiments
      host: 127.0.01
      port: 10001
      threads: 4

from dbt-spark.

rhousewright avatar rhousewright commented on July 22, 2024 1

So this is super interesting. I tried running a similar thing against an EMR cluster running EMR 5.21.0,
with the following generated SQL for my model, and it worked just fine for me. So that's weird?

create table dbt_test_db.my_first_dbt_model
    using parquet
    partitioned by (id)
    as
select 1 as id, 2 as not_id

There's nothing in the 5.21.0 release notes that would indicate any relevant changes (vs 5.20.0), and I'm not doing anything unusual / relevant in terms of cluster config (I am using Glue catalog, in case that matters). The only thing I did differently, I think, than you did is to start the thrift server with sudo /usr/lib/spark/sbin/start-thriftserver.sh (without the --master yarn-client).

I will note that I only have Spark installed on the cluster (I don't have Hive installed) - do you have both installed? If so, is it possible that installing Hive in some way overtakes the HiveServer2 connection to the Spark backend? I haven't had the chance to test that theory yet, though. Config I'm using right now is:
Screen Shot 2019-06-13 at 4 50 54 PM

In general, I'm hoping to get some dedicated time to work on dbt-spark stuff in the next little bit, trying to set aside some time in an upcoming sprint to see if we can get a POC working in our space. Hopefully will learn a lot, and possibly generate some pull requests, through that process!

from dbt-spark.

pgr0ss avatar pgr0ss commented on July 22, 2024

I see the same thing as you (#20 (comment)) when I run via spark-submit. I created a file test.py:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").enableHiveSupport().getOrCreate()
r = spark.sql("show tables in experiments")
print(r)
spark-submit test.py 
DataFrame[database: string, tableName: string, isTemporary: boolean]

This is why I thought it was maybe related to the thrift connection somehow? But it could be something non-thrift related to this library.

from dbt-spark.

drewbanin avatar drewbanin commented on July 22, 2024

I see! That's pretty surprising - I definitely would have expected only a single column to be returned.

My guess is that the results variable (which we expect to contain 3 columns) instead returns some sort of error? I don't have an EMR Spark cluster with a thrift server running to test this against. If you're up to it, it would be awesome if you could:

  1. Add logging/a debugger to /home/paul/dbt-spark/dbt/adapters/spark/impl.py on Line 74
  2. Determine what the value of results is before dbt tries to iterate over it

My guess is that there's some sort of error response, given that this is (I believe) the first query that dbt tries to run. With this Thrift approach, just creating the connection does not necessarily validate that auth/config/etc are working as intended (as it would with, say, Redshift). My guess is that there's some sort of error connecting to Spark, and the results variable doesn't actually contain records from the database. That's just a hunch though -- I'm really not sure what's happening!

If you're able to tell what's happening, we should definitely be able to write some code to handle that case better. Let me know!

from dbt-spark.

pgr0ss avatar pgr0ss commented on July 22, 2024

I'm a little swamped for the next few days but I will try to get back to this and help debug.

from dbt-spark.

pgr0ss avatar pgr0ss commented on July 22, 2024

I added a pdb breakpoint in the code and played around with results. There's no error, but there's only a single column:

(Pdb) results.column_names
('tab_name',)

(Pdb) results[0]
<agate.rows.Row object at 0x7fb71eec7ab0>

(Pdb) results[0].get("tab_name")
'mytable'

I tried to change the code to assume a single column for now to see if I could get farther:

diff --git a/dbt/adapters/spark/impl.py b/dbt/adapters/spark/impl.py
index e4364ad..f2da50c 100644
--- a/dbt/adapters/spark/impl.py
+++ b/dbt/adapters/spark/impl.py
@@ -71,7 +71,10 @@ class SparkAdapter(SQLAdapter):
             'schema': True,
             'identifier': True
         }
-        for _database, name, _ in results:
+        for name in results:
+            _database = "experiments"
+            name = name.get("tab_name")
             relations.append(self.Relation.create(
                 database=_database,
                 schema=_database,

But now I get this error:

$ dbt run
Running with dbt=0.13.0
Found 1 models, 0 tests, 0 archives, 0 analyses, 99 macros, 0 operations, 0 seed files, 0 sources

22:50:08 | Concurrency: 4 threads (target='dev')
22:50:08 |
22:50:08 | 1 of 1 START view model experiments.my_first_dbt_model............... [RUN]
22:50:08 | 1 of 1 ERROR creating view model experiments.my_first_dbt_model...... [ERROR in 0.61s]
22:50:08 |
22:50:08 | Finished running 1 view models in 1.83s.

Completed with 1 errors:

Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
  Error while compiling statement: FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.InvalidTableException: Table not found _dummy_table

Done. PASS=0 ERROR=1 SKIP=0 TOTAL=1

I'm using python 3.6.

from dbt-spark.

pgr0ss avatar pgr0ss commented on July 22, 2024

In addition to the diff above, I also changed the example from a view to a table and it worked:

% cat models/example/my_first_dbt_model.sql

-- Welcome to your first dbt model!
-- Did you know that you can also configure models directly within
-- the SQL file? This will override configurations stated in dbt_project.yml

-- Try changing 'view' to 'table', then re-running dbt
{{ config(materialized='table') }}


select 1 as id
% dbt run         
Running with dbt=0.13.1
Found 1 models, 0 tests, 0 archives, 0 analyses, 99 macros, 0 operations, 0 seed files, 0 sources

21:15:29 | Concurrency: 4 threads (target='dev')
21:15:29 |
21:15:29 | 1 of 1 START table model experiments.my_first_dbt_model.............. [RUN]
21:15:51 | 1 of 1 OK created table model experiments.my_first_dbt_model......... [OK in 21.22s]
21:15:51 |
21:15:51 | Finished running 1 table models in 23.42s.

Completed successfully

Done. PASS=1 ERROR=0 SKIP=0 TOTAL=1

from dbt-spark.

pgr0ss avatar pgr0ss commented on July 22, 2024

I'm not sure if this is another difference with trying to connect via thrift tcp, but the sql it's generating doesn't seem to be valid.

Here's a model that chains off the example model:

% cat models/example/dbt_example_model_2.sql
{{ config(
  materialized='table',
  partition_by=['id'],
  file_format='parquet',
) }}

select id from {{ ref('my_first_dbt_model') }}

This generates:

create table experiments.dbt_example_model_2

    using parquet
    partitioned by (id)
    as


select id from experiments.my_first_dbt_model

which gives syntax errors. I manually fixed it by:

  1. Moving partitioned up.
  2. Giving the partitioned column a type
  3. Changing using to stored as
create table experiments.dbt_example_model_2

    partitioned by (id int)
    stored as parquet
    as


select id from experiments.my_first_dbt_model
;

This now parses but gives this error:

FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table (state=42000,code=10068)

It seems like you need to first create the table and then load into it?

from dbt-spark.

pgr0ss avatar pgr0ss commented on July 22, 2024

Ok, I think the syntax does depend on how you connect to spark. If I ssh to the cluster and run spark-sql, the generated sql works fine. But if I run it through beeline[1], I need to use my transformed sql. I wonder if the thrift connection (#20) is just not compatible with the currently generated sql? In which case I need to figure out how to get the http connection working.

[1] /usr/lib/spark/bin/beeline -u 'jdbc:hive2://localhost:10000/default'

from dbt-spark.

rhousewright avatar rhousewright commented on July 22, 2024

Interesting. I think that those syntax differences are, basically, Spark SQL vs HiveQL (see https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html). Maybe there's some configuration options for thrift to change it into SparkSQL mode? Stabbing wildly in the dark; things I'm reading suggest that they should be more Spark-native (e.g. http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/), but apparently not always. If there's not a way to change the mode, maybe would need to look into something like https://livy.incubator.apache.org instead of Thrift.

Are you using EMR? What release / Spark version?

from dbt-spark.

pgr0ss avatar pgr0ss commented on July 22, 2024

I think that those syntax differences are, basically, Spark SQL vs HiveQL

I've come to the same conclusion, and I'm surprised they are so different.

Maybe there's some configuration options for thrift to change it into SparkSQL mode?

It seems like the master branch of this plugin wants to connect to the spark thrift server over http. Maybe that's the native spark mode? If I use the thrift branch (#20), then it seems to connect to the HiveQL server. Unfortunately, I'm not sure how to start the spark thrift server properly so that the authentication code in here works with it.

We use EMR 5.20 with Spark 2.4.0.

We actually already use Livy for other stuff. I was thinking about spiking out changing this plugin to use it, but that's a larger piece of work, and I'm not sure when I'll be able to find time for it.

from dbt-spark.

drewbanin avatar drewbanin commented on July 22, 2024

Thanks for the info and follow ups! This all sounds reasonable to me. Really interesting that #20 connects right to the HiveQL server - that makes a lot of sense!

So, what do you guys think we should do here? I am very interested in making this Spark plugin work with Spark clusters hosted outside of DataBricks (the env we used when building this plugin). I'm not familiar with Livy, but can do some more reading here.

Curious how you guys think we should proceed

from dbt-spark.

pgr0ss avatar pgr0ss commented on July 22, 2024

We do have both spark and hive, and it does seem like hive somehow takes over. I ran your command and I see this:

$ sudo /usr/lib/spark/sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /var/log/spark/spark-root-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-ip-10-49-18-147.out

from dbt-spark.

rhousewright avatar rhousewright commented on July 22, 2024

Do you have an option of just not running Hive on your clusters? There might be some other under-the-covers settings that would affect what Thrift talks to, but that sounds like a bit of a rabbit hole.

from dbt-spark.

rhousewright avatar rhousewright commented on July 22, 2024

Yes, absolutely - will make an update to the docs in the pull request to make all of this more clear. That’s fascinating that they’re running on two ports!

from dbt-spark.

drewbanin avatar drewbanin commented on July 22, 2024

Wow, thanks for the update @pgr0ss! @rhousewright lmk when you have time to revisit that PR -- happy to give it another look and get it merged 🤞

from dbt-spark.

rhousewright avatar rhousewright commented on July 22, 2024

@drewbanin pushed some additional details re: EMR to the pull request. Let me know what you think!

from dbt-spark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.