Giter Club home page Giter Club logo

Comments (7)

qoega avatar qoega commented on August 23, 2024 1

I will try, ok.
From my understanding data still fits into file cache after it is inserted so there should be no difference in level of data caching.
Previous queries do not leave any statics and we do not compute any per column statistics yet. I would love to have this feature to better optimize query execution, but we are not there yet.

from db-benchmark.

qoega avatar qoega commented on August 23, 2024

As I understand you talk both about sorted and unsorted results.

For ClickHouse there are no additional indexes built.

For unsorted case data is stored in insert order.

LowCardinality columns allow to aggregate/compare values as integers and then replace to value from a hashtable. LC mapping is not global - it is just on a block level, so I do not consider it as an index. Just specific data codec. It still needs to merge this mappings between blocks.

To reproduce locally on tiny server I just use this part without "insert into ans" as it will not show EXPLAIN for insert query. No indices are used

CREATE TABLE G1_1e9_1e2_0_0 (id1 LowCardinality(Nullable(String)), id2 LowCardinality(Nullable(String)), id3 Nullable(String), id4 Nullable(Int32), id5 Nullable(Int32), id6 Nullable(Int32), v1 Nullable(Int32), v2 Nullable(Int32), v3 Nullable(Float64)) ENGINE = MergeTree() ORDER BY tuple();
INSERT INTO G1_1e9_1e2_0_0 SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/h2o/G1_1e9_1e2_0_0.csv.gz');

EXPLAIN indexes=1 SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count() AS cnt FROM G1_1e9_1e2_0_0 GROUP BY id1, id2, id3, id4, id5, id6 FORMAT Null;

┌─explain──────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))      │
│   Aggregating                                    │
│     Expression (Before GROUP BY)                 │
│       ReadFromMergeTree (default.G1_1e9_1e2_0_0) │
└──────────────────────────────────────────────────┘

EXPLAIN PIPELINE SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count() AS cnt FROM G1_1e9_1e2_0_0 GROUP BY id1, id2, id3, id4, id5, id6

┌─explain──────────────────────────────────────────────────────────────────────────┐
│ (Expression)                                                                     │
│ ExpressionTransform × 8                                                          │
│   (Aggregating)                                                                  │
│   Resize 88                                                                   │
│     AggregatingTransform × 8                                                     │
│       StrictResize 88                                                         │
│         (Expression)                                                             │
│         ExpressionTransform × 8                                                  │
│           (ReadFromMergeTree)                                                    │
│           MergeTreeSelect(pool: PrefetchedReadPool, algorithm: Thread) × 8 01 │
└──────────────────────────────────────────────────────────────────────────────────┘

So I think result on 150 cores server is just that fast...

from db-benchmark.

jangorecki avatar jangorecki commented on August 23, 2024

I don't mean sorted/unsorted, but collecting statistics about the data and re-using those statistics in later queries.

from db-benchmark.

qoega avatar qoega commented on August 23, 2024

I understand it. When data is sorted there is some notion of primary index. It is not used for this query to filter something straight ahead. When it is unsorted there is no index even on primary key. We do a full scan on necessary columns in both scenarios and aggregate them explicitly. There is no count/sum index per key or something, so I expect answer is "there is no precomputed data that is used to optimize this query".

from db-benchmark.

jangorecki avatar jangorecki commented on August 23, 2024

@qoega ok, thanks for clarification. If you will happen to be again on running the script in future, could you swap q10 and run at start of the script, before q1, and let us know what's the timing?

from db-benchmark.

Tmonster avatar Tmonster commented on August 23, 2024

DuckDB doesn't use an index, and statistics aren't giving any advantage here. DuckDB aggregates in two phases. In the first phase threads have their own fixed-size hash table, which is small enough to mostly stay in the CPU caches. Data is also partitioned in this phase. In the second phase, thread-local data is combined, and each thread combines a partition. At this point we have all the data, so DuckDB can allocate a hash table that is big enough to fit everything for the partition. This strategy prevents a lot of random access and also prevents any resizing of hash tables during the first and second phase. This design also scales well with the high number of threads on the machine.

from db-benchmark.

jangorecki avatar jangorecki commented on August 23, 2024

OK, thank you both for explanations. Those timings looks amazingly impressive!

from db-benchmark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.