Giter Club home page Giter Club logo

Comments (6)

jangorecki avatar jangorecki commented on August 23, 2024

Fact to dimensions joins are pretty different degree of complexity than big to big join, where join column (even after distinct) does not fit into memory, and algorithm has to do merge in batches and then reduce the batches to single results. I am not sure if join task stresses that well enough, but that's the aim of q5.
Moreover, users dont query csv data, they load them once and then query loaded data multiple times, so having tasks designed for read csv and single query are rather uncommon.

from db-benchmark.

hkpeaks avatar hkpeaks commented on August 23, 2024

For a batch of query job, obviously users will configure to output reports as much as possible, it is not a problem. Normally they will request a web interface rather then Windows/Linux CLI. e.g. https://youtu.be/6hwbQmTXzMc

e.g. for a retail business group, there are a massive volumn of invoices, fact table contains item code, quantity..., fact table does not have unit price and exchange rate. When an accountant need to prepare ad-hoc reports based on specific condition, e.g. Select transactions which represent 80% of sales revenues, so the process must included joining fact table with unit price and exchange rate in order to get total amount for each invoice. I had worked for accountancy over 2 decades, so I understand the painfulness of accountants to deal with data. Every working day are doing routine and ad-hoc reports. Now I am working as a programmer, so I have solved the performance issues of full join no matter it is running in <=32GB desktop PC for billions of rows.

from db-benchmark.

Tmonster avatar Tmonster commented on August 23, 2024

Hi @hkpeaks, Thank you for the suggestions.

If you have any questions about the scripts to run the benchmark, feel free to ask them with reproducible steps. As far as I know, the script in _utils/repro.sh should install the necessary tools to run this benchmark. You will, however, need to install each solution yourself following the {solution}/install.sh scripts provided.

As for every developer writing their own script to get the best performance, developers are encouraged to modify their groupby-{solution} and join-{solution} scripts to get the best performance.

For the Join Table test, yes it is interesting, but as Jan said big to big joins can be complex and stress the memory limits of a system, and it seems like every solution has trouble joining to 50GB files.

If you would like to open a PR to add the peaks solution to the repro, I would be happy to review it. You can use the work done in #12 and #11 as a template

from db-benchmark.

hkpeaks avatar hkpeaks commented on August 23, 2024

H2O script is very complex for me and users so I prefer to write simple script for benchmarking. Coming benchmarking I plan to compare 3,000 files (total 300 Million Rows) for software Polars, DuckDB, Peaks, R Data.Table. But I concern whehter the script I write for third-parties software is a optimized script for performance. So you can help to review the below DuckDB script whether it is optimal.

***** Visitors of this page can recommend more software to be included in the benchmarking given that they provide relevant scripts. *****

Test data: https://github.com/hkpeaks/peaks-consolidation/blob/main/Input/1.csv

  • My current test for DuckDB is only 22 second.

  • Spark (fail to run in my Windows 11 propertly, I suspect the JAVA VM conflict with Windows 11, and it is not resonable the app requires to open my Port - very high risk to attract hacker). So I remove Spark for comparison. And try to request experience Databrick user test it --- how many node it can be faster than number one of the benchmark?

  • Polars is planning bug fix. pola-rs/polars#9201
    Noticed it is fixed pola-rs/polars#9251

  • Seem Data.Table does not support larger than memory becuase my current test trigger out of memory "Error: cannot allocate vector of size 1.1 Gb Execution halted"

  • Considering to include Tableau Hyper. But concern it does not supports script, must use GUI to click each function.

  • Pandas is not support streaming pandas-dev/pandas#53249

  • My Peaks is extending its streaming engine to cover many files scenario (currently support single billion-row file), and also need to add new composit function Filter2GroupBy, it can save substaintial memory and CPU resources.

******** DuckDB ********

import duckdb

import time

s = time.time()

con = duckdb.connect()

con.execute("""copy (SELECT Ledger, Account, DC, Currency, SUM(Base_Amount) as Total_Base_Amount
FROM read_csv_auto('input/3000Files/*.csv')
WHERE Ledger>='L30' AND Ledger <='L70'
GROUP BY Ledger, Account, DC, Currency)
to 'output/DuckFilterGroupByCSV.csv' (format csv, header true);""")

e = time.time()

print("DuckDB FilterGroupBy CSV Time = {}".format(round(e-s,3)))

******** Polars ********

import polars as pl
import time
import pathlib
s = time.time()

table1 = (
pl.scan_csv("Input/3000Files/*.csv")
.filter((pl.col('Ledger') >= "L30") & (pl.col('Ledger') <= "L70"))
.groupby(by=["Ledger", "Account", "DC","Currency"])
.agg([
pl.sum('Base_Amount').alias('Total_Base_Amount'),
]))

path = "Output/PolarsFilterGroupByCSV.csv"
table1.lazy().collect(streaming=True).write_csv(path)

e = time.time()
print("Polars FilterGroupBy CSV Time = {}".format(round(e-s,3)))

******** Data.Table ********

library(data.table)
library(readr)

s <- Sys.time()

setDTthreads(10)

temp <- list.files ("D:/Benchmark/Input/3000Files", full.names = TRUE, pattern = "\.csv$")
DT <- rbindlist(lapply(temp, fread), fill = TRUE)
DT <- DT[Ledger >= 'L30' & Ledger <= 'L70', .(Total_Base_Amount = sum(Base_Amount)), by = .(Ledger, Account, DC, Currency)]
fwrite(DT, 'output/DataTableGroupByCSV.csv')

e <- Sys.time()
print(paste0("R-Data.Table FilterGroupBy CSV Time = ", round(e-s,3)))

******** Peaks Consolidation ********

Select{D:/Benchmark/Input/3000Files/*.csv | Ledger(L30..L70)}
GroupBy{Ledger, Account, DC,Currency => Sum(Base_Amount) ~ PeaksResult.csv}

In fact I don't understand why many open source software requires users to fill-in excessive settings to do simple task.
My designed script is only request users to fill-in essential information. Peaks may not able to win number one from this benchmark, but its script is simplex one. DuckDB recorded 22 seconds for 3000 files with total 300 million rows is very challenging for me.

from db-benchmark.

Tmonster avatar Tmonster commented on August 23, 2024

DuckDB recorded 22 seconds for 3000 files with total 300 million rows is very challenging for me.

I'm not sure what you want me to do here. I can say that the duckdb code looks fine, and I am happy to hear that it finishes in a timely manner.

The scope of issues here is db-benchmark issues/feature requests/additional solutions etc. It seems like you would like to add peaks to the benchmark, and potentially add more benchmarking categories.

I don't plan on writing the infrastructure to add peaks to the benchmark like the project is not popular enough yet for me to invest the time and effort.
For the additional benchmarks, similar to what Jan Gorecki said, the requested benchmark questions aren't common workloads, so I won't be investing the time to add them to the benchmark.

If there's anything else I can help you with that is an issue or feature request, let me know, otherwise I will close this issue as not planned

from db-benchmark.

hkpeaks avatar hkpeaks commented on August 23, 2024

I have tested the latest version of Polars 0.19.1 which can process 10 billion-row jointable with file size of 231GB (Output 389GB). Please consider using sink_csv function for your next benchmarking exercise.

import polars as pl
from time import time
from datetime import datetime

start_time = datetime.now()

master = pl.scan_csv("Inbox/Master.csv")

fact_table = pl.scan_csv("Inbox/10000M-Fact.csv")

result = fact_table.join(master, on=["Product","Style"],
how="inner").with_columns((
pl.col("Quantity") * pl.col("Unit_Price")).alias("Amount"))

result.sink_csv("Outbox/PolarsJoinResult.csv")

elapsed = datetime.now() - start_time
print(f"\nPolars InnerJoin Duration (in second): {
elapsed.total_seconds():.3f}")

from db-benchmark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.