What Is a SQL Validator?

The validator passed the query. Syntax clean. Tables exist. Columns exist. Joins are well-formed. The CI check is green. The developer ships.

An hour later, production reporting is showing every customer's lifetime value as the value of their largest single order. The query parsed. The query ran. The query was wrong.

I have lived this in explain-analyze-first debugging where the planner shows you exactly which index was used, which join order was chosen, and which sort spilled to disk — and tells you absolutely nothing about whether the result is the result the developer was trying to compute. The plan is correct for the query. The query is correct for what it says. What it says is not what the developer meant.

SQL validators have the same shape. They check the layer that is checkable cheaply — syntax, references, types — and report green on the queries that will most spectacularly fail in production. The validator is not lying. The validator is doing the job it was scoped for. The job that catches the catastrophes lives one layer down.

Step One — The Wrong Assumption

"The query passed validation. We can ship it."

"The validator is green. The CI check passed. The query is good to deploy."

The first instinct treats validation as a binary. The validator runs, passes or fails, and a passing run is permission to ship. The premise is that "valid" means "correct," and the validator's green light is a sufficient quality gate.

The premise is wrong because SQL validators check the layer they can check — the layer of grammar and reference resolution — and that layer is the easy half of correctness. The hard half is semantic: does this query compute what the analyst meant to compute, against the data as it actually exists, with the joins resolving the way the analyst expected, and the aggregations rolling up at the granularity the consumer assumed. None of those questions are answerable by parsing the SQL. All of them are answerable by running it against representative data and checking the result, which the validator does not do.

Step Two — The Partial Signal

Three of four validation layers run cleanly. The fourth is whether the result is right.

SQL validators do real work. Syntax checking catches typos before they hit a production parser. Reference resolution catches dropped tables, renamed columns, and dialect-specific function calls that will fail at runtime. Type checking catches comparisons between incompatible types. Linting catches stylistic issues, anti-patterns, and known performance hazards like SELECT-star in production code. Each is a real category of failure, and catching them in CI is unambiguously better than catching them in production.

What none of them catch is whether the query computes the intended result. A LEFT JOIN where the developer meant INNER produces a different result with no syntactic difference. An aggregation grouped at the wrong level produces a different number with no syntactic difference. A subquery that filters before the join versus after the join produces a different result with no syntactic difference. Each of these is a semantic choice, each looks identical to the validator, and each is a category of bug that ships under green CI.

This is the partial signal. The validator's coverage of the syntactic layers is high. The semantic layer is invisible to the validator by design, because semantic correctness requires data and the validator does not have the data.

Step Three — The Failed Fix

You add a runtime check on a sample dataset. The sample is too small to surface the bug.

The team's response is reasonable. Add a runtime check. Run the query against a sample dataset in CI, snapshot the result, compare against an expected value. The sample is built for speed — ten thousand rows, a representative slice of production. The runtime check passes. The query ships.

The bug surfaces in production because the bug is data-dependent. The query produces the right answer when the underlying tables have the cardinalities and distributions of the sample, and the wrong answer when the production data has the cardinalities and distributions of production. A subtle COUNT(DISTINCT) edge case that does not fire on the sample fires on the full dataset. A grouping that produces a single row in the sample produces ten thousand rows in production. The runtime check was correct on the data it had. The data it had was not the data that mattered.

The fix did not fix anything because it added a layer of validation against unrepresentative data. The team is now in a worse position than before, because the green CI now includes a fake semantic check, and developers reasonably trust the green light more than they did when it was only syntactic.

Step Four — The Real Failure

It was never a parser gap. It was that semantic correctness requires data, intent, and a contract — and validators have only the SQL.

The actual structure of SQL correctness has three layers. The syntactic layer — what the validator checks — is parseability and reference resolution. The data layer — what runtime checks attempt to cover — is whether the query produces the right answer against representative data. The intent layer — what nothing automated catches — is whether the answer the query produces is the answer the analyst was trying to produce, expressed in a form the consumer of the result expects.

The intent layer cannot be checked from the SQL alone. It can only be checked against an explicit contract: a description of what the query is supposed to compute, the granularity of the result, the semantics of the joins, the expected cardinality, the boundary cases. With that contract, semantic checks become possible — the query's expected output can be specified, the runtime can verify it, the validator can fail when the expectation is not met. Without the contract, the runtime check is checking against whatever the query happens to produce, which is a tautology and not a test.

This is the lesson DBAs and query authors have been re-learning since the introduction of automated query tooling. The tool catches the layer it can catch. The catastrophic failures live one layer down, where data, intent, and SQL meet, and where a tool with only the SQL cannot reach.

Step Five — The Definition

Now the definition lands.

A SQL validator is a tool that checks SQL queries for syntactic correctness, reference resolution, type compatibility, and adherence to style rules — the layer of correctness that is decidable from the SQL text alone. Semantic correctness, which depends on data and intent, lives at a deeper layer that validators do not reach. The validator is necessary; it is not sufficient.

Most definitions describe a SQL validator as a tool that checks queries for errors. The description is accurate and silent on which errors. The errors validators catch are the ones decidable from the SQL alone. The errors that bring down production are usually the ones that depend on data and intent, which the SQL alone does not encode.

Programs that rely on the validator as the quality gate ship the catastrophic failures with green CI. Programs that treat the validator as the precondition for the semantic check ship a smaller set of catastrophic failures.

What Solix Enforces

Validation at the boundary; semantic contracts bound to the data.

What Solix's data quality and governance capabilities enforce in this category is the contract layer that semantic validation depends on. Queries that consume data from the governed environment inherit the schema, the cardinality expectations, and the referential constraints from the records themselves — not from a separately-maintained spec that drifts independently of the data. The validator becomes a check against the contract, not a check against syntax alone.

For SAP ECC, Oracle E-Business Suite, and the long tail of operational systems whose data feeds analytics queries, the same model applies. The contract travels with the records. The semantic checks become operational. The green CI light becomes meaningful, because it is checking against an expectation that exists, not against a tautology.

Three things to do this week

  • Audit your SQL CI for the difference between syntactic and semantic checks. Walk through your last ten merged SQL changes. Identify which CI checks were syntactic (parser, linter, reference resolution) and which were semantic (output verified against an expected result). The ratio is the diagnostic. Programs with 100% syntactic checks are running validation theatre.
  • Pick one critical query and write its semantic contract before the next deploy. Specify, in plain language: what the query is supposed to compute, the granularity of the result, the expected row count range, the semantics of each join, and the boundary cases. The contract is then version-controlled alongside the SQL. The semantic check becomes a diff against the contract, not a guess against unrepresentative data.
  • Stop trusting validators as the quality gate; treat them as a precondition. The honest framing is that the validator is one of three checks: syntactic (validator), data-correctness (runtime against representative data with explicit expectations), and intent (contract-bound semantic check). Programs that ship on the first alone are building dashboards of green CI that hide the failures that matter.

References