Giter Club home page Giter Club logo

Comments (6)

hsyuan avatar hsyuan commented on August 17, 2024

@vraghavan78

from gporca.

dhanashreek89 avatar dhanashreek89 commented on August 17, 2024

DDL

create table sale(sales_id int, product_id int, quantity int);
create table product (product_id int, description text);
create table sales_old (like sale);

insert into sale select i, random()::int%1000, random()::int%100000 from generate_series(1,1000)i;
insert into product select random()::int%1000, 'abc'|| i from generate_series(1,1000)i;
insert into sales_old select i, random()::int%1000, random()::int%100000 from generate_series(1,1000)i;

ORCA performs aggregation after the the join. Sale table is broadcast for NLJ

EXPLAIN SELECT s.product_id,sum(s.quantity) 
FROM sale s, product p, sales_old s1 WHERE p.product_id > s.product_id GROUP BY s.product_id;
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice4; segments: 3)  (cost=0.00..3026748101636960.50 rows=2 width=12)
   ->  HashAggregate  (cost=0.00..3026748101636960.50 rows=1 width=12)
         Group By: sale.product_id
         ->  Redistribute Motion 3:3  (slice3; segments: 3)  (cost=0.00..3026748101636960.50 rows=1 width=12)
               Hash Key: sale.product_id
               ->  Result  (cost=0.00..3026748101636960.50 rows=1 width=12)
                     ->  GroupAggregate  (cost=0.00..3026748101636960.50 rows=1 width=12)
                           Group By: sale.product_id
                           ->  Nested Loop  (cost=0.00..3025976926542662.00 rows=96396886787335872 width=8)
                                 Join Filter: true
                                 ->  Nested Loop  (cost=0.00..23625794812.05 rows=96351120006 width=8)
                                       Join Filter: product.product_id > sale.product_id
                                       ->  Sort  (cost=0.00..721.50 rows=333808 width=8)
                                             Sort Key: sale.product_id
                                             ->  Table Scan on sale  (cost=0.00..438.71 rows=333808 width=8)
                                       ->  Materialize  (cost=0.00..540.93 rows=865927 width=4)
                                             ->  Broadcast Motion 3:3  (slice2; segments: 3)  (cost=0.00..537.46 rows=865927 width=4)
                                                   ->  Table Scan on product  (cost=0.00..441.24 rows=288643 width=4)
                                 ->  Materialize  (cost=0.00..458.23 rows=1000475 width=1)
                                       ->  Broadcast Motion 3:3  (slice1; segments: 3)  (cost=0.00..457.23 rows=1000475 width=1)
                                             ->  Table Scan on sales_old  (cost=0.00..438.70 rows=333492 width=1)
 Optimizer status: PQO version 2.54.0
(22 rows)

Looks like this query would be a good candidate for this transform. If I rewrite the query as follows, ORCA will perform GroupBy before Join:

EXPLAIN SELECT s.product_id, SUM(s.quantity)
FROM (select product_id, SUM(quantity) as quantity from sale group by product_id) s, product p, sales_old s1 WHERE p.product_id > s.product_id
GROUP BY s.product_id;
                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice5; segments: 3)  (cost=0.00..1356857848.95 rows=2 width=12)
   ->  GroupAggregate  (cost=0.00..1356857848.95 rows=1 width=12)
         Group By: sale.product_id
         ->  Sort  (cost=0.00..1356857848.95 rows=1 width=12)
               Sort Key: sale.product_id
               ->  Redistribute Motion 3:3  (slice4; segments: 3)  (cost=0.00..1356857848.95 rows=1 width=12)
                     Hash Key: sale.product_id
                     ->  Result  (cost=0.00..1356857848.95 rows=1 width=12)
                           ->  HashAggregate  (cost=0.00..1356857848.95 rows=1 width=12)
                                 Group By: sale.product_id
                                 ->  Nested Loop  (cost=0.00..1356857821.24 rows=222223 width=12)
                                       Join Filter: true
                                       ->  Nested Loop  (cost=0.00..1324184.57 rows=223 width=12)
                                             Join Filter: product.product_id > sale.product_id
                                             ->  Broadcast Motion 3:3  (slice3; segments: 3)  (cost=0.00..431.05 rows=2 width=12)
                                                   ->  Result  (cost=0.00..431.05 rows=1 width=12)
                                                         ->  GroupAggregate  (cost=0.00..431.05 rows=1 width=12)
                                                               Group By: sale.product_id
                                                               ->  Sort  (cost=0.00..431.05 rows=1 width=12)
                                                                     Sort Key: sale.product_id
                                                                     ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..431.05 rows=1 width=12)
                                                                           Hash Key: sale.product_id
                                                                           ->  Result  (cost=0.00..431.05 rows=1 width=12)
                                                                                 ->  HashAggregate  (cost=0.00..431.05 rows=1 width=12)
                                                                                       Group By: sale.product_id
                                                                                       ->  Table Scan on sale  (cost=0.00..431.01 rows=334 width=8)
                                             ->  Table Scan on product  (cost=0.00..431.01 rows=334 width=4)
                                       ->  Materialize  (cost=0.00..431.03 rows=1000 width=1)
                                             ->  Broadcast Motion 3:3  (slice1; segments: 3)  (cost=0.00..431.03 rows=1000 width=1)
                                                   ->  Table Scan on sales_old  (cost=0.00..431.01 rows=334 width=1)
 Optimizer status: PQO version 2.54.0
(31 rows)

from gporca.

hsyuan avatar hsyuan commented on August 17, 2024

The group by before join in your example is manually added, not generated by the rule.

from gporca.

dhanashreek89 avatar dhanashreek89 commented on August 17, 2024
create table sale(sales_id int, product_id int, quantity int);
create table product (product_id int primary key, description text);   => add PK constraint
insert into sale select i, i, i  from generate_series(1,1000)i;
insert into product select i, 'abc'|| i from generate_series(1,1000)i;

set optimizer_enumerate_plans=on;
set optimizer_plan_id=2;

EXPLAIN SELECT sum(s.product_id) FROM sale s, product p WHERE p.product_id = s.product_id GROUP BY s.product_id;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=0.00..863.00 rows=334 width=8)
   Hash Cond: product.product_id = sale.product_id
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.02 rows=1000 width=4)
         ->  Table Scan on product  (cost=0.00..431.01 rows=334 width=4)
   ->  Hash  (cost=431.40..431.40 rows=334 width=12)
         ->  Gather Motion 3:1  (slice3; segments: 3)  (cost=0.00..431.40 rows=1000 width=12)
               ->  Result  (cost=0.00..431.36 rows=334 width=12)
                     ->  GroupAggregate  (cost=0.00..431.36 rows=334 width=12)
                           Group Key: sale.product_id
                           ->  Sort  (cost=0.00..431.35 rows=334 width=12)
                                 Sort Key: sale.product_id
                                 ->  Result  (cost=0.00..431.16 rows=334 width=12)
                                       ->  Result  (cost=0.00..431.15 rows=1000 width=12)
                                             ->  GroupAggregate  (cost=0.00..431.15 rows=1000 width=12)
                                                   Group Key: sale.product_id
                                                   ->  Sort  (cost=0.00..431.15 rows=1000 width=4)
                                                         Sort Key: sale.product_id
                                                         ->  Broadcast Motion 3:3  (slice2; segments: 3)  (cost=0.00..431.08 rows=1000 width=4)
                                                               ->  Table Scan on sale  (cost=0.00..431.01 rows=334 width=4)
 Optimizer: PQO version 2.55.13
(20 rows)

In the above plan, the groupBy is pushed below the join but the final plan looks inferior because of the Broadcast and two Gather Motions.
Nonetheless this alternative demonstrates CXformPushGbBelowJoin transform.

from gporca.

hsyuan avatar hsyuan commented on August 17, 2024

Hmm, what is missing previously is the primary key or unique key, in fact, so that we only need to aggregate one side of join, which makes sense.
Without unique key, we have to aggregate both side, e.g. the following 2 queries are equivalent:

create table t1(a int, b int);
create table t2(a int, b int);

select t1.b, count(t1.a) from t1 join t2 on t1.b=t2.b group by 1;

select t1.b, cnt*c from 
 (select b, count(a) as cnt from t1 group by 1) t1 join 
 (select b, count(*) as c from t2 group by 1) t2
 on t1.b=t2.b;

But the Orca plan above is so bad that the plan does group aggregate twice!

from gporca.

hsyuan avatar hsyuan commented on August 17, 2024

Thanks for answering my question, @dhanashreek89. Closing this issue.

from gporca.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.