Investigate Switching TestOptimizer to use Cascades (and whether or not we want to)

Currently it uses the heuristic optimizer. It seems that to use the cascades optimizer, we might need to introduce a more advanced dummy cost model. It might also be the case that we want to just keep testing on the heuristic optimizer.

See #140 for context.

Logical->Logical rules

List of logical->logical rule files to port from datafusion:

Make sub-issues as needed and link them here.

remove futures::block_on when planning

we can (1) collect all table scans from the plan node, (2) in the async context, retrieve the schema, and (3) remove async from schema inference.

Produce metadata along with RelNode

add group id metadata so DAG can be identified
remove physical collector from optd-core

Display cardinality and cost in `explain`

We should display cardinality and cost of the physical plan in explain.

We can implement this with RelNodeMetaMap introduced in #65, by storing the cost in RelNodeMeta. Since the explain output is produced in a separate traversal of the plan tree after optimization, and Pretty objects are immutable after construction, we should pass the RelNodeMeta as a parameter to dispatch_explain.

cost model - compute lower / upper bound

currently, we do not compute upper bound / lower bound for the pruning process. we simply explore the full logical plan space and compute a winner for the physical plan. the pruning logic is already there, and we just need to provide a lower / upper bound of groups to the system.

[Logical Optimizer] register logical properties

The logical properties of the logical optimizer includes:

schema
- #56
outer cols (columns are not defined in the underlying expr trees, eg: agg output column...)
equivalent cols (equivalent column sets)
not null cols
keys and weak keys cols (keys are unique not null columns, weak keys are unique columns(might contain nulls))
foreign keys cols (cols have foreign key constraints)

support physical properties

I don't think this task is important because I don't see physical properties very useful when evaluating Datafusion. Sometime later, we should add physical property support for the optimizer framework.

support more datafusion expr

Cast
Between

Decouple datafusion's logical optimization/conversion

Remove:

        let batches = df.collect().await?;

from datafusion-optd-cli/src/exec.rs

because it will internally run datafusion's logical optimizer. We should try to call the other collect function instead, after running our own optimizer and converting the optd plan into datafusion's (impl ExecutionPlan). This will allow us to run our end-to-end optimized query on datafusion.

Feat: heuristic rule might result in already generated expr making expr replacement tricky

Context

#88

Current Design

When the generated expr by heuristic rule already exist in another group, we merge two groups and mark the old expr as a dead end (mark all rules have been fired for it.) It might be costly.

And here are several alternatives:

remove the old expr, but keep the group so that its parent can still find it. (might lead to an empty group without expr, might cause future problems)
add a flag to mark the expr is a dead end

Test Guidance

It happens in two scenarios:

the generated expr by heuristic rule already exist in another group
the generated expr is a placeholder type containg a new group id

With small number of rules, the first scenario is quite rare. The feat might be improved and tested with more rules in the future.

Pruning Support

Resources:

Tracking: TPC-H

plan nodes
rules
datafusion executors
runtime data collection

Volatility

This would be a far-in-the-future thing, but just as a note, for filter pushdown, we would have to take it into account when pushing past project

Extend schema to contain table name, and migrate cost model's ColumnRef property usage to schema

From Chi:

yes we should do that, feel free to create an issue...
this is so-called "column name inference"
assign a name for unnamed columns, and use user-specified names for projections
and for the top-level node, if we have physical property implemented, it will always be a gather node
unless all plan nodes are singleton

ColumnRef and Schema seem to have a lot of purpose overlap, I think that we can merge their functionality together.

Parallelize optimizer tasks

Bug: error handling join predicate

Without the datafusion logical optimizer enabled, the join predicate would be stored in filter instead of the on field. This makes equal condition join break because we do not handle that.

We need to reconsider the logic of handling join conditions.

> explain select * from t1 join t2 on t1.t1v1 = t2.t2v1;
Join.on: []
Join.filter: Some(
    BinaryExpr(
        BinaryExpr {
            left: Column(
                Column {
                    relation: Some(
                        Bare {
                            table: "t1",
                        },
                    ),
                    name: "t1v1",
                },
            ),
            op: Eq,
            right: Column(
                Column {
                    relation: Some(
                        Bare {
                            table: "t2",
                        },
                    ),
                    name: "t2v1",
                },
            ),
        },
    ),
)

thread 'main' panicked at optd-datafusion-bridge/src/lib.rs:320:14:
called `Result::unwrap()` on an `Err` value: unsupported join filter: Some(BinaryExpr(BinaryExpr { left: Column(Column { relation: Some(Bare { table: "t1" }), name: "t1v1" }), op: Eq, right: Column(Column { relation: Some(Bare { table: "t2" }), name: "t2v1" }) }))

Tracking: parity with Postgres for TPC-H cardinality estimations

Notes

Sometimes Postgres does really bad (even worse than our magic numbers!). However, the goal right now is simply to match Postgres, not to match the truecard. When I say fix, I mean match Postgres.
- This is because we know exactly what we need to do to match Postgres but we don't know what we need to do to match the truecard.
Experiments ran with scale factor 1.0, seed 15721
If no other PR is mentioned, then the query was run based on #132

Queries

move group id to OptRelTyp as Placeholder(GroupId)

currently, we have so many representation of plan nodes:

RelNode: the only thing we should have in the core framework
OptRelNode: a set of pre-defined plan node that works for most database systems, should be moved to another crate
RelRuleNode: that's the evil thing. When handling rule conversion, we do not have access to the full plan node. The only thing we have is some part of it and the group id of the children. This is so-called binding in Cascades/Columbia.

The RelRuleNode also makes it harder to convert from the dynamically-typed RelNode to statically-typed LogicalXXX structs. LogicalXXX::from_rel_node is only implemented on RelNode instead of RelRuleNode.

Given that the optimizer only supports Cascades, we can ask the user to have a placeholder(group_id) function over RelNodeTyp, so that the optimizer can put some placeholders into user's plan node representation, and everything can work as usual. With that, we can eliminate RelRuleNode from the system and use a unified representation of plan nodes everywhere.

On the user side, they only need to ensure their plan type enum contains a special placeholder type for optd internal use.

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum OptRelNodeTyp {
    Placeholder(GroupId), // every set of plan nodes should have this
    // Plan nodes
    // Developers: update `is_plan_node` function after adding new elements
    Projection,
    ...
}

Tracking: make sure optd does not crash for TPC-H queries

Missing features

Subquery: Q2, Q3, Q4, Q11, Q16, Q17, Q20, Q21, Q22
- Correlated subquery is not supported in DF physical optimizer. DF's logical optimizer eliminates OuterReferenceColumn by moving the column to the outer query. Since we do not know how to access columns of the outer query, I think we should enable DF's logical optimizer when dealing with subqueries.
- ~~With DF's logical optimizer enabled, we get Error during planning: table 'datafusion.public.xxxx' not found error.~~ When using explain_with_logical in the planner test, make sure to execute the DDLs with execute_with_logical!
InListExpr: Q12, Q16, Q19 #77
IntervalMonthDayNano: Q1, Q10, Q14(#78)
Multiple (non-equal) predicates for one join: Q11, Q13, Q22
ColumnRefPropertyBulder
- LogOp: Q16
Left anti and left semi join: Q4, Q16, Q18, Q20, Q21, Q22

more logical nodes

List type in the optimizer
LogicalProjection
LogicalAggregation
LogicalFilter
LogicalApply (correlate / dependent join) @skyzh

The Devil of MergeGroup

Why we have MergeGroup in the first place?

For both testing the optd cascades core and shrinking the search space, we were trying to add heuristic rules in cascades. However, some simple rules like eliminateFilter which eliminate filter node when the filter predicates are always true lead to the situation that we have to add a MergeGroup in optd core.

// memo before eliminateFilter
Group(n): Filter(true) -> child(Group(n+1))
Group(n+1): expr
// memo after eliminateFilter
Group(n): Filter(true) -> child(Group(n+1)), placeholder(Group(n+1))
Group(n+1): expr
// The result returned by eliminateFilter means that Group(n) and Group(n+1) are actually logically equivalent. So we add MergeGroup in optd to merge these two groups.

The devil of MergeGroup

There's no MergeGroup definition in any cascades paper nor columbia paper and corresponding implementation. The adding of mergeGroup is actually pretty experimental. Besides, in real world optimizer, there are always rewrite phase/prepare phase before the cascades framework to make the expressions normalized. There's actually no need to do such heuristic rules in cascades.

The introduction of MergeGroup leads to two problems: loops in operator tree.

loops in operator tree

Take Select * from t1 where true as an example. The initial groups are:

// memo
G1: filter(true) -> child(G2)
G2: Scan (t1)

After eliminateFilter in cascades rule, it becomes:

// memo
G1: filter(true)->child(G2), Scan(t1)
G2: merged to G1, reduceGroupId=G1

It is ok when we execute the query as it only cares the winner. But when we wants to traverse the operator tree somehow, it goes into loops for expressions like filter(true)->child(G2) as they are expressions points to themselves as a child.

It makes tricky bugs for some logic other than execution, for example, join order enumeration, tree traversal, debug printing and so on.

A tricky case is that will it make firing tasks out of order?

In optd, the order of the tasks fired are carefully designed following the columbia paper.

We first fire OptimizeGroup task for the root/top/first group, it then traverse all the expressions in the task and fire OptimizeExpr task for all the logical exprs in the group, and OptimizeInput task for all the physical exprs in the group.

OptimizeExpr first call ExploreGroup tasks for the expr's children and then traverse the rules and fire ApplyRule task for the matched rules for expr.

Newly generated logical exprs are fired with new OptimizeExpr as they are not originally in the group when optimizeGroup or ExploreGroup is called.

OptimizeInput is responsible to calculate the costs for the expr and update winner for the group, it fires optimizeGroup for the children group when the children group haven't have a winner. But except for the root/top/first group, other groups have to be the children node of the physical expression in OptimizeInput can it be called with optimizeGroup.

In abstract, the cascades framework expect all the groups and their children are fully explored (apply all matching rules) from top group to the leaf group, and the costs and winners are updated then from the top group's physical expression to the children group.

If we merge two groups, these groups exist before merge (as rules can not create new group themselves, they can only return existing group). They are either being called with exploreGroup/OptimizeGroup or not. If they are called with exploreGroup before, we can not explore the group again.(exploreGroup skip groups following the groupId, though they are merged, the group Id remains the same). And optimizeGroup can only be called for root group or children group of physical expressions in parent group.

So if the merge destination group is explored before, the merge source group is not explored and both groups are not being called with optimizeGroup(it is either not the children group of physical expressions or pruned if we adds pruning), it will miss newly added expressions from merge source group. 🤯🤯🤯 It is a very trivial case and hard to be found.

Future task

MergeGroup will lead to debugging nightmare and tricky problems. We may need to remove it. Besides, if we don't add heuristic rules into cascades logic, we really don't need it.

Physical Properties

Concept

In cascades framework, physical properties have two features:

derived
required

Derived are the physical properties derived from children, for example, scan from a table whose column 1 is ordered ascend makes the scan node have the derived property as column 1 ordered ascend.

Required are the physical properties required by the query, for example, select * from t1 order by t1.v1 makes the root node having the required physical properties as v1 is required to be ordered.

Once derived properties cannot support all required properties, operator created by enforcer rule is inserted in the operator tree.

And where to insert the operator created by enforcer rule will depend on the calculated costs. (Determine select v1, v2 from t1 order by v1 is transferred to Sort->Projection->TableScan or Projection->Sort->TableScan, more complicated cases include join, filter, subqueries...)

Design

Derived properties are for expression base, which means multiple expressions in one group in optd can have different derived physical properties. (For example, sort-merge join provide the column in order, while hash join not, but they are logically equivalent.)

Required properties are for group base, imagining a query like select * from t1 order by v1. The root node group should have the required properties (ordering = v1).

Problems to be solved

A critical issue is that we should calculate the cost to determine where to put enforcer rule, for example, Select node can choose children which already have the required properties in their derived and the cost is the children expression cost, it can also choose children which not have the derived properties and insert an enforcer rule on top of that, the cost then becomes the enforcer operator + the children cost.

However, as the children representations are placeholder for group id in optd, we cannot directly pick best children (winner) with/without required physical properties. And as the derived properties are for expression based, we might need to traverse all the expressions in the child group.

This change of logic hugely interferes the optimizeGroup task(we need to provide an interface for required physical properties) and optimizeInput task(we need to calculate different winners for different physical properties requirement) and the main memo structure(we need to add physical properties storage for each expr/exprId and add infer derived physical properties when adding new expr to group).

References

CockroachDB's doc(https://github.com/cockroachdb/cockroach/blob/fe36467047fbbf88569fc03cff97fb4d728122f3/pkg/sql/opt/xform/optimizer.go#L319) provides a good example of how to develop physical properties and enforcer rule. It launches two optimizeGroup tasks, one with required properties, one without. The task without required properties need to add the cost of enforcer operator. And it then compares the cost of their result. For the children, it is separating the child group in different properties, so these plans are different: (select G2="ordering: y" G3) and (select G2 G3).

Columnbia's paper provides definition of the winner structure with physical properties. It states as

In Columbia, a winner is also used to store the temporary result of a search. While the costs of physical multi-expressions of a group are being calculated, the cheapest yet found expression is stored as a winner. During the optimization process, the winner is improved and finally the best (cheapest) plan is found. Sometimes, when no physical multi-expression can be found with the required physical property, we store the multi-expression pointer as NULL to indicate no winner for that physical property. Since no winner is also a solution for a search of this sub-problem, this information is memoized and will be useful in the future optimizing process. The following is the definition of data members in a WINNER class in Columbia.

Calcite is creating subGroups within groups. subGroups are expressions that logically equivalent and have same physical properties. It then optimizeSubGroups and finds winner within each subgroup. (https://github.com/apache/calcite/blob/4823cb7760913f236e7f0f2cb149325b55a3f124/core/src/main/java/org/apache/calcite/plan/volcano/RelSubset.java#L84)

Implementation

Generally, my idea is to

change the winner in groupInfo into an array of winner with its physical properties (derived properties).
Adds context(required properties) to optimizeGroup, optimizeInput, OptimizeExpression.
For group with required properties, call optimizeGroup twice to get two kinds of winner, winner with required properties and winner without required properties + cost of enforcer operator, and storing winner for the required properties for current group.
for each newly added expressions, call infer_physical_properties to created derived physical properties.

testing framework for optd

We can run regression test on the plan we generate for TPC-H SQLs. There is already a tool for such planner tests: https://github.com/risinglightdb/sqlplannertest-rs

[LogicalOptimizer] Schema

TODO

add names and is_nullable to schema
- #61
make schema types align with datafusion::arrow::Schema type (one to one mapping)
- #57
adjust schema derivation in existing rules
- future work: accurate nullable derivation for aggregations

Substrait conversion layer

Put it in a separate crate, parallel to [optd-datafusion-bridge](https://github.com/cmu-db/optd/tree/main/optd-datafusion-bridge)
There exists a crate under the datafusion tree, datafusion-substrait, that claims to convert from datafusion to substrait representations. However, on the physical side (which is what we need to convert), it seems like it has the same problem that would be encountered if we were to send impl ExecutionPlan trait objects around, in that it doesn't seem to know the type or any information about each node. It seems like we wouldn't want to use it, in this case, and instead convert directly from optd nodes. I may be wrong on this---check for yourself here.
Bonus points: make sure serialization is working (can be a separate/later PR)

cost model - use Vec<f64/usize/Value> representation

Cost usually contains row count, network cost, I/O, compute cost, etc. A reasonable abstraction for it in optd is simple a vector of value. The cost model implementor can interpret the vector as anything they want, and inside the optimizer, it will simply compare it by the first element in the vector (which is the weighted cost).

Bug: projection pull up rule does not work when the join has multiple predicates

When there are multiple join predicates, they are stored as LogOp(And)

optd/optd-datafusion-bridge/src/into_optd.rs

Lines 299 to 305 in ad37149

 let expr_list = ExprList::new(log_ops); 

 Ok(LogicalJoin::new( 

 left, 

 right, 

 LogOpExpr::new(LogOpType::And, expr_list).into_expr(), 

 join_type, 

 ))

But this case is not handled in the projection pull up rule.

optd/optd-datafusion-repr/src/rules/joins.rs

Lines 330 to 342 in 755db92

 let expr = cond.into_rel_node(); 

 let mut children = Vec::with_capacity(expr.children.len()); 

 for child in &expr.children { 

 children.push( 

 rewrite_condition( 

 Expr::from_rel_node(child.clone()).unwrap(), 

 mapping, 

 left_schema_size, 

 projection_schema_size, 

 ) 

 .into_rel_node(), 

 ); 

 }

Here, it seems we assume cond is a BinOpExpr (so that its children are aways Expr), but for LogOp its child is a ExprList, whose type is List which is not an expression. That's why the unwrap() on line 335 panics.

We can hard code this scenario, or categorize OptRelNodeTyp::List into expression (but I'm not 100% sure whether is only used to store expression).

rule DSL

we already have the intermediate representation of rule matchers, and therefore it would be easy to implement a Rust macro that converts a rule DSL into the IR. one proposal:

(join inner !a !b !cond) -- defines a matcher on join a and b, for join commute rule
(join inner (join inner !a !b !condab) !c !condabc -- defines a matcher on two joins, for join assoc rule

!a, !b, !cond will be picked up and the rule can use them to compose the new plan node.

Persistent Storage for optd

datafusion plan node set

the correct plan node set generally follows the Calcite definition of plan nodes. we might need to adjust them a little bit in order to make it work with Datafusion.

	let expr_list = ExprList::new(log_ops);
	Ok(LogicalJoin::new(
	left,
	right,
	LogOpExpr::new(LogOpType::And, expr_list).into_expr(),
	join_type,
	))

	let expr = cond.into_rel_node();
	let mut children = Vec::with_capacity(expr.children.len());
	for child in &expr.children {
	children.push(
	rewrite_condition(
	Expr::from_rel_node(child.clone()).unwrap(),
	mapping,
	left_schema_size,
	projection_schema_size,
	)
	.into_rel_node(),
	);
	}

cmu-db / optd Goto Github PK

optd's People

Contributors

Stargazers

Watchers

Forkers

optd's Issues