- dbt version 0.16.1
- Spark version 2.4.4
Hey folks,
I have encountered a problem where creating models/relations using dbt and EMR/Glue is limited to the point where the amount of metadata stored in Glue about the relations makes the dbt run/query hanging.
First, I noticed that after creating 6 relations in my target schema every subsequent run of dbt results in dbt hanging, not advancing.
Looking at the Thrift logs in the master node, I noticed that the problem was the following query:
SparkExecuteStatementOperation: Running query '/* {"app": "dbt", "dbt_version": "0.16.1", "profile_name": "emr_datalake", "target_name": "dev", "connection_name": "list__dbt_spark"} */
show table extended in dbt_spark like '*'
' with 6adf8996-814e-4a8c-93a1-1caa85c13340
Looking at the dbt.log file, I noticed that the process stops at the step TGetResultSetMetadataResp:
2020-07-21 19:16:57.017962 (ThreadPoolExecutor-1_0): Acquiring new spark connection "list__dbt_spark".
2020-07-21 19:16:57.018235 (ThreadPoolExecutor-1_0): Re-using an available connection from the pool (formerly list_).
2020-07-21 19:16:57.019701 (ThreadPoolExecutor-1_0): NotImplemented: add_begin_query
2020-07-21 19:16:57.019795 (ThreadPoolExecutor-1_0): Using spark connection "list__dbt_spark".
2020-07-21 19:16:57.019868 (ThreadPoolExecutor-1_0): On list__dbt_spark: /* {"app": "dbt", "dbt_version": "0.16.1", "profile_name": "emr_datalake", "target_name": "dev", "connection_name": "list__dbt_spark"} */
show table extended in dbt_spark like '*'
2020-07-21 19:16:57.020019 (ThreadPoolExecutor-1_0): /* {"app": "dbt", "dbt_version": "0.16.1", "profile_name": "emr_datalake", "target_name": "dev", "connection_name": "list__dbt_spark"} */
show table extended in dbt_spark like '*'
2020-07-21 19:16:57.020118 (ThreadPoolExecutor-1_0): TExecuteStatementReq(sessionHandle=TSessionHandle(sessionId=THandleIdentifier(guid=b'iF\x80\x1fj\xcfB\xcf\x86o\xd4\xa4,\xa1\xdd/', secret=b'vR$j\x9cbGW\x8c\x07\x91\x84\xb0\xe8\xb6\x80')), statement='/* {"app": "dbt", "dbt_version": "0.16.1", "profile_name": "emr_datalake", "target_name": "dev", "connection_name": "list__dbt_spark"} */\nshow table extended in dbt_spark like \'*\'\n ', confOverlay=None, runAsync=True, queryTimeout=0)
2020-07-21 19:16:57.164044 (ThreadPoolExecutor-1_0): TExecuteStatementResp(status=TStatus(statusCode=0, infoMessages=None, sqlState=None, errorCode=None, errorMessage=None), operationHandle=TOperationHandle(operationId=THandleIdentifier(guid=b'\xdb\xad4\x04\x91\xfdMQ\xac\xd3\tN&\xd8DC', secret=b't\x80\xfa<\x1e\xf2@\x83\xae]\t\xd69\x1bq\xea'), operationType=0, hasResultSet=True, modifiedRowCount=None))
2020-07-21 19:16:59.199866 (ThreadPoolExecutor-1_0): TGetOperationStatusResp(status=TStatus(statusCode=0, infoMessages=None, sqlState=None, errorCode=None, errorMessage=None), operationState=2, sqlState=None, errorCode=None, errorMessage=None, taskStatus=None, operationStarted=None, operationCompleted=None, hasResultSet=None, progressUpdateResponse=None)
2020-07-21 19:16:59.200094 (ThreadPoolExecutor-1_0): Poll status: 2, query complete
2020-07-21 19:16:59.200222 (ThreadPoolExecutor-1_0): SQL status: OK in 2.18 seconds
2020-07-21 19:16:59.778318 (ThreadPoolExecutor-1_0): TGetResultSetMetadataResp(status=TStatus(statusCode=0, infoMessages=None, sqlState=None, errorCode=None, errorMessage=None), schema=TTableSchema(columns=[TColumnDesc(columnName='database', typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7, typeQualifiers=None), arrayEntry=None, mapEntry=None, structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]), position=1, comment=''), TColumnDesc(columnName='tableName', typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7, typeQualifiers=None), arrayEntry=None, mapEntry=None, structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]), position=2, comment=''), TColumnDesc(columnName='isTemporary', typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=0, typeQualifiers=None), arrayEntry=None, mapEntry=None, structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]), position=3, comment=''), TColumnDesc(columnName='information', typeDesc=TTypeDesc(types=[TTypeEntry(primitiveEntry=TPrimitiveTypeEntry(type=7, typeQualifiers=None), arrayEntry=None, mapEntry=None, structEntry=None, unionEntry=None, userDefinedTypeEntry=None)]), position=4, comment='')]))
Since I am not that familiar with the steps that I should be expecting, I removed 6 previously created relations and created 7 dummy relations, in order to reject the "6 maximum number of relations" hypothesis.
WITH dummy AS (
SELECT 1 AS col1,
1 AS col2,
1 AS col3,
1 AS col4,
1 AS col5,
1 AS col6,
1 AS col7,
1 AS col8,
1 AS col9,
1 AS col10,
1 AS col11,
1 AS col12
)
SELECT *
FROM dummy
I created 7 of these copies:
Running with dbt=0.16.1
Found 37 models, 0 tests, 0 snapshots, 0 analyses, 143 macros, 0 operations, 0 seed files, 17 sources
21:57:25 | Concurrency: 1 threads (target='dev')
21:57:25 |
21:57:25 | 1 of 7 START view model .dbt_spark.src_dummy_1................ [RUN]
21:57:27 | 1 of 7 OK created view model .dbt_spark.src_dummy_1........... [OK in 2.34s]
21:57:27 | 2 of 7 START view model .dbt_spark.src_dummy_2................ [RUN]
21:57:29 | 2 of 7 OK created view model .dbt_spark.src_dummy_2........... [OK in 2.20s]
21:57:29 | 3 of 7 START view model .dbt_spark.src_dummy_3................ [RUN]
21:57:31 | 3 of 7 OK created view model .dbt_spark.src_dummy_3........... [OK in 2.16s]
21:57:31 | 4 of 7 START view model .dbt_spark.src_dummy_4................ [RUN]
21:57:34 | 4 of 7 OK created view model .dbt_spark.src_dummy_4........... [OK in 2.14s]
21:57:34 | 5 of 7 START view model .dbt_spark.src_dummy_5................ [RUN]
21:57:36 | 5 of 7 OK created view model .dbt_spark.src_dummy_5........... [OK in 2.05s]
21:57:36 | 6 of 7 START view model .dbt_spark.src_dummy_6................ [RUN]
21:57:38 | 6 of 7 OK created view model .dbt_spark.src_dummy_6........... [OK in 2.25s]
21:57:38 | 7 of 7 START view model .dbt_spark.src_dummy_7................ [RUN]
21:57:40 | 7 of 7 OK created view model .dbt_spark.src_dummy_7........... [OK in 2.17s]
21:57:42 |
21:57:42 | Finished running 7 view models in 22.47s.
Checking the Thrift logs, I noticed a step that should have been the next one had the query resumed and not hanged.
2020-07-21 19:20:50.359768 (ThreadPoolExecutor-1_0): TFetchResultsResp
This call/return is a big one since it returns the full metadata about the relations in your target schema, essentially what you are asking for through show table extended in <schema_name> like '*'
In order to confirm it is the size of the metadata payload returned from Glue, I add more columns and text to a model.
WITH dummy AS (
SELECT 1 AS col1,
1 AS col2,
1 AS col3,
1 AS col4,
1 AS col5,
1 AS col6,
1 AS col7,
1 AS col8,
1 AS col9,
1 AS col10,
1 AS col11,
1 AS col12,
1 AS col13,
1 AS col14,
1 AS col15,
1 AS col16,
1 AS col17,
1 AS col18,
1 AS col19,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col20,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col21,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col22,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col23,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col24,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col25,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col26,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col27,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col28,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col29,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col30,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col31,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col32,
'TEST123TEST123TEST123TEST123TEST123TEST123' AS col33
)
SELECT *
FROM dummy
This actually executes, since the metadata about the new relations is yet to be returned (next dbt run).
Running the same set of models results in hanging because the metadata size of the new models is too big (pointing that there is a certain cap to the amount of data that can be returned/parsed).
Running with dbt=0.16.1
Found 37 models, 0 tests, 0 snapshots, 0 analyses, 143 macros, 0 operations, 0 seed files, 17 sources
22:03:56 | Concurrency: 1 threads (target='dev')
22:03:56 |
22:03:56 | 1 of 7 START view model .dbt_spark.src_dummy_1................ [RUN]
With the increase, I get the hanging on the dbt run until I kill it manually.