Comments (5)
It needs to scan the entire table if you don't use partitioning, if you do partition then you need to give an explicit partition predicate to reduce the amount of partitions you read
from delta-rs.
Thanks for the reply @ion-elgreco , why does it need to scan the entire table into memory before it starts writing data? Is this just a lack of optimization, or is there something fundamental to what merge is doing that prevents this kind of optimization?
from delta-rs.
Thanks for the reply @ion-elgreco , why does it need to scan the entire table into memory before it starts writing data? Is this just a lack of optimization, or is there something fundamental to what merge is doing that prevents this kind of optimization?
It needs to scan the entire table because it needs to find out which rows the merge into condition applies to.
from delta-rs.
It needs to scan the entire table because it needs to find out which rows the merge into condition applies to.
I understand this. However, it doesn't need to hold the entire table in memory while it is performing the merge. It could do this in a streaming fashion – this is more or less what you get out of the box with datafusion.
from delta-rs.
To answer my own questions here:
why does it need to scan the entire table into memory before it starts writing data
MergeBarrier
holds on to all records for a particular file until either a delete, update, or insert is encountered, or until the input data is fully exhausted. This means that in workloads with a large input set, where merge does not typically have deletes, updates, and inserts, the entire dataset will usually be buffered in memory.
This could be avoided if we somehow explicitly told DataFusion to fully exhaust one file at a time so that data could be flushed.
I can imagine using partitioning to break the merge into many operations so that all the data is not pulled into memory at once. But in my case, I'd probably rather put the effort towards not using Merge.
from delta-rs.
Related Issues (20)
- segmentation fault - Python 3.10 on Mac M3 HOT 11
- Failed to commit transaction: 15 when writing an Iterator of recordbatches HOT 3
- Choose which columns to store min/max values for HOT 1
- delete_dir bug HOT 2
- append is deleting records HOT 2
- AWS WebIdentityToken exposure in log files HOT 8
- CDC support in deltalog when writing delta table HOT 6
- `IN (...)` clauses appear to be ignored in merge commands with S3 - extra partitions scanned
- SchemaError occurs during table optimisation after upgrade to v0.18.1 HOT 4
- Slow add_actions.to_pydict for tables with large number of columns, impacting read performance HOT 1
- `DeltaScanBuilder` does not respect datafusion context's `datafusion.execution.parquet.pushdown_filters` HOT 1
- Error decoding field 'stats' when creating checkpoint HOT 3
- Regression in Python multiprocessing support HOT 9
- `RecordBatchWriter` only creates stats for the first 32 columns; this prevents calling `create_checkpoint`. HOT 6
- Provide documentation how to configure various storage backends
- Write also insert change types in writer CDC
- Transaction log parsing performance regression HOT 11
- improve cdc generation with the insert only scenario
- Asymmetry with DeltaTablePartition and actions
- When z-order optimizing, keep partition in only one row_group (if possible)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from delta-rs.