SET PLANNER TO COST
or similar, and default to a more "static" one.
If you're looking to be cross platform then hopefully you've already got a layer to translate intent to an appropriate SQL dialect, so you can drop it in there.
For a cute little select statement, it might be true. However it's just not manageable most of the time.
Note that it's an introduction to building database systems (not just using them).
As an example, consider a memo group that contains both a MergeJoin and a HashJoin operator. The MergeJoin "requires" its inputs to be sorted on the same key. If an input is not already sorted, then we insert a "sort enforcer" operator that will provide the required ordering. Of course, that increases the cost of producing the input, which the optimizer must take into account.
So, getting back to your question, for any given memo group, we always select the equivalent plan with the lowest cost, for a given set of physical properties. However, that has the effect of sometimes selecting a memo group operator that has a higher cost, because it "pays off" higher up the operator tree. Back to my example, say the SQL query is something like this:
The lowest cost plan might use a MergeJoin, because its output is already naturally sorted on "name". However, the input to the MergeJoin might be higher-cost than logical equivalents, because it must be sorted. Therefore, we have the situation you alluded to, where we choose a higher-cost subtree because it minimizes the cost of the larger tree that contains it.
SELECT * FROM a, b WHERE a.name=b.name ORDER BY a.name
Im pretty sure cost-based optimization is common enough and practically proven effective enough that its not really fair to call it a “leap of faith”; eg im pretty sure mysql and postgres both do it (and you can it see it in their EXPLAINs), and I imagine most commercial dbs do as well
Its more of just a natural progression as the db gets improved
Its a good article/work nonetheless, but nothing out of the ordinary in db implementation afaik
Im not at all versed in the subject, but it seems like the memo datastructure isn’t significant to the correctness of the outcome: it makes the search space smaller/faster, but doesn’t seem to lead to better cost assignment, and the logical equivalence rules should be enforced regardless of the memo dt (the alternative being external to the memoization structure)
Is it just that faster search => more paths searched before timing out/giving up? And perhaps easier to reason about/encode?
You're correct on your points (search space can be smaller/faster, with less memory usage). In addition, when you're dealing with a piece of software as complex as a SQL optimizer, choosing modular, elegant abstractions can make a big difference in terms of understandability, maintainability and extensibility of the codebase. That in turn leads to fewer bugs, such as subtle violations of logical equivalence when transforming the tree. I think the more structured, general approach to optimization allows mere mortals to manage higher levels of complexity, in which the RDBMS defines 100's of transformations, offers many join ops (hash, merge, nested, zigzag, etc.) and data types (geospatial, json, etc), as well as incorporating data locality in the new world of geo-distribution.
As one example to illustrate why there's a practical correspondence between well-designed abstractions/architecture and correctness (i.e. fewer bugs), consider these PRs that we recently merged:
These PRs randomly disable transformation rules and perturb the cost of expressions in the optimizer, in order to test alternate (but equivalent) plans. It took just 100 lines of code to add this support, because it simply plugs into the general optimization framework. And in return, our existing test cases can now test multiple different plans, allowing us to shine light into the dark corners where bugs like to hide (cockroaches, of course!).
As another example, we have a testing mode called "optsteps", which will single-step the optimizer and print out the tree after each transformation to make it easy to track down bugs. Here's an example of that output, again made possible by the extensible, modular design of the optimizer:
+1 on the sentiment, good to see them continuing to improve, they have lots of competition to look at for examples of what to do / not to do. Progress is good none the less.
Any chance you could share more about this? This general area of how to build a production grade SQL optimizer seems to be a thing that's more scattered in tiny pieces across a wide number of papers, or held in peoples' heads, than aggregated in a teaching manner. It seemed that the realistic general advice on how to build a SQL optimizer was to poach someone from the SQL Server team. ;)
I've generally just gone referring back to the unfinished Building Query Compilers when pondering this subject. Not that the hundreds of references don't provide sufficient material to read though as well, but it'd be interesting to hear what a bootcamp like this presented as state of the art.
Genetic query optimiser: https://www.postgresql.org/docs/current/geqo.html
postgres has a number of hooks, which can be used to override the default behaviour. In particular, there are a bunch of hooks that can be used to install a custom query planner. https://github.com/AmatanHead/psql-hooks/blob/master/Readme....
more generally, ignoring database query optimisation specifically, if you are interested in learning about discrete optimisation techniques, I recommend this course: https://www.coursera.org/learn/discrete-optimization
Thanks for the link. Wasn't aware of this podcast before.
I don't know if Oracle still uses CBO now, or even SQL or PL/SQL but I am sure that a large chunk of their revenue still comes from supporting these legacy systems.
Might be worth a read if you're interested in implementation techniques.
Has any thought been given to whether this query planner could be adapted (much further down the road, I'd guess) to support dynamic replanning? (That is, altering the plan mid-query, if it should be found that the row-count estimates were way off.)
Another commenter posted an interesting paper that's related to your question:
All table statistics are stored in a system table (system.table_statistics). CREATE STATISTICS scans the table, calculating statistics on a column, and adds a new row to system.table_statistics.
We still have a ways to go in making statistics automatic (right now it requires a manual call to CREATE STATISTICS) and more useful to the optimizer (which currently only uses basic stats). We're working on that right now for the 2.2 release.
The best plan for this simple query depends on the selectivity of the predicates. For example, if 20k rows have a.y='foo', but only 10 rows have b.z='bar', then it's best to scan a b.z index, then lookup matching rows in a. But if the #rows is reversed, then it's better to scan an a.y index and lookup in b. This is a simplified example, but we do see queries along these lines in real OLTP workloads.
You're correct that for many OLTP workload queries, simple heuristics are sufficient. However, even if that's true for 90% of queries, it's the last 10% that gets you. We've seen customers with 10 queries in their workload, where 9 work well but the last 1 gets a bad plan that is 10x slower than it could/should be. Maybe they can rewrite it, or maybe they don't have sufficient knowledge to do so. Or perhaps they're using an ORM and don't have control over the queries it's sending to the DB. In addition, many mostly-OLTP workloads contain a few OLAP reporting queries in the mix. Developers don't expect their OLTP database to perform like a dedicated OLAP data warehouse DB, but they also expect it not to fall over when it gets a more complex query.
They have an entire section in their documentation for Python and SQLAlchemy: https://www.cockroachlabs.com/docs/stable/build-a-python-app...