Comments (19)
it might be a trino question than a dbt question, @findinpath, do you know if this kind of transaction works in trino ?
(for example seeding sequence looks like this):
transaction begin
insert 1
insert 2
insert 3
commit or rollback
as far as we've tested, if insert 3 fails, the result of insert 1 and 2 still in the seed table.
Maybe the transaction start/commit in dbt-presto/trino is just a place-holder or remains of a copy from other drivers. (btw, with snowflake, the rollback in previous sample seems to clear up the result of all inserts, as expected)
from dbt-trino.
as far as we've tested, if insert 3 fails, the result of insert 1 and 2 still in the seed table.
Indeed, Trino doesn't support fully ACID semantics.
Trino makes sure that ONE INSERT
statement doesn't create corrupt data by inserting data in one or more staging table(s) and then doing on the target database INSERT INTO target_table SELECT * FROM staging_table(s)
.
See https://www.trinoforum.org/t/insert-performance-on-trino-jdbc-connectors/99/3 for a more detailed explanation about the way that INSERT
works on Trino.
from dbt-trino.
@hashhar , thanks for confirmation. Will file a issue for this.
And yes, that is what we are facing
If I understand correctly any query ran within a transaction that takes > transaction timeout will exhibit this issue?
from dbt-trino.
dbt
is responsible for performing the changes in a transactional fashion.
If you think about it, this is meaningful in order to avoid having corrupt data after running dbt .
@aakashnand the PR trinodb/trino#9846 may be linked to your issue. It will be available only from Trino version 365
.
from dbt-trino.
this is meaningful in order to avoid having corrupt data after running dbt
@findinpath even when we are connecting using AUTOCOMMIT?
dbt-trino/dbt/adapters/trino/connections.py
Line 195 in 8088447
My understanding here is that when we use AUTOCOMMIT, every query is committed as soon as it is finished. In this sense, even if we execute 1000s of queries the end result will not lead to data corruption.
Also, what is the correct solution to solve this issue?
from dbt-trino.
dbt can execute more than one statement corresponding to a materialization.
Maybe I am wrong, please double check this on dbt slack, but this is the reason why dbt works with transactions.
from dbt-trino.
@findinpath Do you have any solution in mind for this or should we discuss this on slack channel #dbt-presto-trino?
The temporary solution we have is to increase transaction.idle-timeout=8h
in config.properties
from dbt-trino.
@aakashnand I have no better solution than what you just posted.
A slightly different approach is to set the sesssion properties on dbt
and affect therefore only the dbt project and not all Trino clients with the transaction idle timeout setting.
my-trino-db:
target: dev
outputs:
dev:
type: trino
user: commander
host: 127.0.0.1
port: 8080
database: analytics
schema: public
threads: 8
http_scheme: http
session_properties:
....
from dbt-trino.
I believe transaction.idle-timeout
cannot be set as a session property.
from dbt-trino.
As suggested by @hovaesco this property cannot be set using session property so we need to set it in the config.properties 🙁
from dbt-trino.
we might have understood what happens but want to find more evidences.
btw @findinpath , could you run this example ?
https://github.com/trinodb/trino-python-client#transactions
somehow it did not work in our env with TWO inserts. ONE insert is fine.
just want to make sure our env. is setup correctly.
from dbt-trino.
I think the reason the transaction does not work is on our side. Anyway, we think the reason the time-out occurred at the 1st place might be this:
- a. https://github.com/trinodb/trino-python-client/blob/master/trino/dbapi.py#L173
- b. https://github.com/trinodb/trino-python-client/blob/c8144dec2d6d44937a7aa1ba295d82b286b292c6/trino/dbapi.py#L130
dbt makes the connection with IsolationLevel.AUTOCOMMIT as default. So when the sql is executed, it creates a request (#a). And when the start_transaction is execute, #b create the transaction in another request instance.
The query and the transaction happens independently. This answers the question why we had the error at the COMMIT but the query has been succeed. And also a simple test with start_transaction -> wait long enough -> commit yields the same error.
We need more test to see how to improve this but there might be 2 options now:
- use a different isolation level when dbt-trino/presto initializes the connection
- change the way trino client implements #a to care about current transaction instance, not only isolation level
from dbt-trino.
@hashhar / @ebyhr can you please take a look at the comment #20 (comment) and help out with feedback?
from dbt-trino.
@bachng2017 Good catch. that indeed looks like a bug in how transactions are handled in the Trino python client.
Can you file the issue against the Python client with a simple reproduction code so that we can verify that the fix that gets implemented actually works?
If I understand correctly any query ran within a transaction that takes > transaction timeout will exhibit this issue?
from dbt-trino.
@hashhar created an issue for this at trino-python-client, pls check
trinodb/trino-python-client#129
from dbt-trino.
PR to address the issue: #30
from dbt-trino.
Merged #30
from dbt-trino.
hello @hovaesco
I would like to confirm about the #30 merge (sorry that I should contact earlier but did not have time)
the fix seems to eliminate (pass) all start_transaction/commit/rollback in the old codes with this reason
dbt-trino adapter uses connection in auto-commit mode, hence there is no need to separately start a new transaction in each query since it makes no effect. In auto-commit mode, the client will create a new transaction for each query.
with the old codes, the typical sequence happens like this:
start transaction -> create tmp table -> create real table -> commit
in new codes, it just
create tmp table -> create real table
so, autocommit
for each query might not be sufficient . What do you think ?
the affect is not too big but the concept is very different from the old codes.
from dbt-trino.
Could you please elaborate more on autocommit for each query might not be sufficient
?
Each query is executed with autocommit
so each operation is committed immediately after performing it.
from dbt-trino.
Related Issues (20)
- Include query_id in run results for seed HOT 1
- Invalid SQL generated for dbt_project_evaluator HOT 1
- dbt-trino adds comment into table create statement by default HOT 2
- dbt-Trino snapshot cannot create __dbt_tmp table after first run HOT 12
- Add tests for aliases
- Add copyright notices to files HOT 1
- Support elementary in improving data-quality capabilities HOT 3
- Extend Hive test coverage
- upgrade to support dbt-core v1.7.0 HOT 1
- Incorrect Schema Used When Renaming Materialized Views HOT 4
- Support CASCADE dropping relations
- Solving for large stage depths HOT 3
- Support `CREATE OR REPLACE` HOT 2
- get_relation not working as already_exists HOT 4
- deltalake rename managed table not allowed arised HOT 1
- Failed to connect to Trino cluster using LDAP auth and HTTP connection HOT 3
- Extra credentials in connection HOT 3
- Not able to create snapshot of a model using dbt-trino==1.7.1 HOT 3
- Refactor to use dbt-adapters interface layer HOT 2
- merge_exclude_columns doesn't work
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-trino.