Comments (7)
Parquet was used at some point in the past. Looking through git history will take you there. Whatever tools can use in-memory data, they are already using it. It is just faster than reading from disk. And then it doesn't matter if it's parquet or csv. Reading data from disk to memory is not part of timings. It is always made at start of the script. Parquet ended up to not be portable as advertised, therefore csv was kept instead. If you want to benchmark tools that runs queries on-disk data, then yes, make sense to look into parquet again. But then only using it for on-disk data cases (size: out of memory OR solution that does not support in-memory model) make sense, not replacing in-memory model.
from db-benchmark.
Actually parquet may be still in use for some tools (possibly only for 1e9 rows) In readme batch benchmark run you can read
re-save to binary format where needed
from db-benchmark.
Yea, I have argued for this shift in the past. Even better would be 50 1GB Parquet files.
The 50GB benchmarks for a single CSV file are really misleading, especially for the engines that are optimized to perform parallel reads.
from db-benchmark.
Benchmarks are not for csv or parquet. They are for in memory data. None of solutions in benchmark uses CSV as it's data model. You may want to reread my previous messages.
from db-benchmark.
@jangorecki - yea, I agree but CSV files have limitations that cause memory issues on queries that wouldn't have issues if the data was stored in Parquet. Let's look at an example pandas query:
x.groupby("id1", dropna=False, observed=True).agg({"v1": "sum"})
For the 1 billion row dataset, this query will error out on some machines if CSV is used, but work if Parquet is used. That means that the selected file format is changing the benchmark results for in memory data.
The file format is impacting the distributed compute engine results even more.
Furthermore, the benchmarks run way, way slower than they should because of the CSV file format. When pandas read_csv
is used on a large file, usecols
should definitely be set, but using read_parquet
and setting the columns
argument would be way better. This result is totally misleading IMO:
from db-benchmark.
query will error out on some machines if CSV is used
Precisely "if in-memory data is used"
Totally agree on that, whenever tool is not able to do in-memory, then fall back to on-disk data is an option. This is how for example spark is now doing join for 1e9 rows:
db-benchmark/spark/join-spark.py
Line 48 in 00c4fdd
can be easily adapted for pandas from there.
Moreover if it is faster to use on-disk rather than in-memory for a specific solution, then on-disk should be preferred as well, but I think it is rather uncommon scenario, because whatever on-disk format is being used, it has to be loaded into memory anyway for making computation on it. Also it should be well investigated it is a general rule, and not only under certain conditions (like super fast disk).
from db-benchmark.
I think it would make sense also including a (maybe separate) benchmark for seeing how fast these engines can query a parquet file (or perhaps a hive partitioned directory of parquet files).
I think that reflects real world use cases pretty well, though I agree once you include too many knobs, it's hard to have a fair and representative benchmarking setting.
from db-benchmark.
Related Issues (20)
- Question HOT 2
- Publish DataFusion to report HOT 11
- Publish databend results HOT 1
- misplaced datafussion code in history.Rmd
- `=~` matching in run.sh causes "duckdb-latest" to match "duckdb" HOT 2
- duckdb-latest fails HOT 7
- datafusion does not correctly make chunk results HOT 3
- Inconsistent multicore use across languages? HOT 4
- Dask tests could use optimization HOT 3
- many solutions are missing from join 1e9 HOT 5
- Add one Digit to Benchmark Summary HOT 1
- Add cuDF HOT 4
- Published duckdb results are not reproducible HOT 2
- suspicious timings for groupby q10 HOT 7
- DOC: modin results "see README" HOT 1
- I created a benchmark but DuckDB run times are super slow and not sure why HOT 6
- Fix polars join script
- Add a new backend - Clojure's tablecloth
- Add set key to data.table operations.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from db-benchmark.