Evolving AI for Databases

This post was motivated by discussions at ICDE a few weeks ago, but between travel and moving and my normal levels of procrastination, it’s taken way too long to write up. It’s less polished than an optimistic-me imagined back in May, but it’s time to get my basic thoughts written down before I completely forget.

Applying AI to databases (DBs) is a field of research that looks at how to replace or improve DB components using models (or now, agents). Most of the DB community seems to agree that these approaches are unlikely to gain widespread adoption in production systems due to strict performance guarantees that these systems must meet. However, I think there is promise in reframing AI for DB to instead look at using agents in a setup like AlphaEvolve to improve the rules, heuristics, and statistics used by DB systems. In this framing, there is no new nondeterminism introduced into query processing; it is just a search for better algorithms for hard problems where we already accept non-optimal solutions.

Databases from 10,000 feet

AlphaEvolve

AI for DB using AE

Databases from 10,000 feet

For the non-DB folks out there, this section gives a brief overview of query processing in relational database systems and some of the hard problems that systems must navigate to achieve good performance.

At a high level, when you submit a SQL query the system:

Parses the query
Translates it into a relational algebra expression
Generates an optimized physical execution plan
Executes the physical query plan and returns the result

There are a lot of hard problems in the optimization and execution phases. For example, in the optimization phase, the system must decide things like:

If there are multiple tables, in which order should they be joined?
If there are multiple tables, how should they be joined? (Nested loop or hash join?)
If there are indexes and views available, which ones should be used to find and read tuples?

This isn’t an exhaustive list; it just illustrates some places where rules and heuristics are used.

The output of the optimization phase is a physical execution plan that defines the order of operators (scan, filter, join, etc.) in the query, and what algorithm is used for each operator; it maps to the code that the system executes to retrieve result tuples. Most systems generate a physical execution plan using a form of cost-based optimization where they estimate the cost of different physical plans and pick the cheapest one. Cost estimation relies on cardinality estimation, which is the problem of estimating the size of intermediate results (i.e., the number of tuples after applying a filter or joining two tables). Cardinality estimation is yet another place where database systems use statistics and heuristics (which often lead to wildly inaccurate estimates).

Picking a bad physical plan can be devastating and cause a query to take orders of magnitude longer to execute than a good plan. Therefore, query optimizers are more concerned with pruning bad plans than with finding the optimal query plan.

Example work in AI for DB that focuses on query optimization/execution includes:

Training a model to enhance query optimization
Using a model to generate a learned index that is optimal for specific data vs. a traditional B-Tree or hash index

Despite promising results in research papers, there has been little adoption of model-based components in databases due to the intolerance of production systems to even low probabilities of bad predictions that result in catastrophic query plans. Also [steps onto tiny soap box], the pressure to publish incentivizes evaluations that are beautiful but don’t translate to enterprise settings [steps off tiny soap box].

AlphaEvolve

AlphaEvolve implements an evolutionary approach to iteratively evolve algorithms based on automated evaluation metrics.

The user provides as input:

An initial implementation of some program. They can flag which sections they want to be modified.
Automatic evaluation code. The program can be evaluated along multiple metrics (in fact, the paper found that even when there is one primary metric of interest, using multiple metrics can lead to better solutions).

The AlphaEvolve system iteratively evolves the program to optimize for the evaluation metrics:

The prompt sampler pulls historical program implementations to improve on (starting with the user-provided implementation, and then augmenting it with model-generated implementations as it runs).
An ensemble of LLMs proposes new implementations via code diffs on top of previous solutions.
Evaluators score each candidate program using the user-provided logic.
The program database keeps track of promising solutions that can be included in prompts of future iterations.

Others have achieved good results programmatically generating kernels using a similar (but yet-unnamed) approach.

AI for DB using AE

In which everything comes together into an acronym soup.

While replacing DB components with models seems to have limited utility outside of research papers, I think that databases serve as a rich pool of problems amenable to the techniques from AlphaEvolve. Agents could evolve and improve on the existing rules/heuristics/statistics used by DB systems rather than replace components outright.

In database-land, take cardinality estimation as a target task. We could use agents or models to perform cardinality estimation directly; the input is samples, statistics, and maybe an embedding of the query plan; the output is an estimate. However, in AlphaEvolve the agents modify the code that the system uses to perform cardinality estimation; the output is a code diff. By using AlphaEvolve, there are no agents involved in actual query processing; they hopefully just improve the query processing logic.

Some open questions:

Would any of this work at all?
How well would agents evolve large implementations (e.g., in query optimizers with hundreds of rules)? The AlphaEvolve white paper says that it “evolves up to hundreds of lines of code”.
What evaluation metrics are important? Database systems have many metrics that can be automatically measured such as cardinality estimation error, query runtime, and query optimization latency.

Database problems seem to be an ideal setting for a technique like AlphaEvolve since metrics can be automatically evaluated given test queries and databases. Generating this test set may not be trivial, but I suspect that all systems already have test suites to prevent regressions due to code changes in these components; these test suites could be reused to generate evaluation metrics. In fact, this is another place where the AlphaEvolve approach is more feasible than replacing components with models; training a model (e.g., for cardinality estimation) requires a much larger training set of queries/databases vs. a test set to validate a code change.