The best part I think is "pg_query is special. It doesn’t actually implement parsing SQL. It works by extracting C source code directly from Postgres and wraps it with a nice Rust interface. This allows PgDog to understand all queries that Postgres can."
One of the things that puts metaprogramming on wheels is being able to pull in grammars, schemas and such for external systems. Think how many cool tools we could have if we could do this in a more general way.
For a well-constrained Postgres schema, a deterministic SQL compiler can be built (with plenty of logic programming/constraint-solving/hypergraph-analysis) that can integrate arbitrary external systems data.
While this is NP-hard, with some clever use of embeddings and ontologies, and use of every single constraint AST within the DB to reduce the search space, this becomes a feasible problem.
At the moment I'm saving up for the next development cycle, so not only PG->PG schema mappings can be solved for (JSON schema is next!). Hope this sounds interesting :)
Yes, the same way. It's all based on the extremely useful `https://github.com/pganalyze/libpg_query` project, which is where the "extracted the parser from Postgres" part comes in.
Supabase's LSP also uses tree-sitter for corrections and autocomplete, because one drawback of using the server's source is that pg_query only works on well-formed/executable SQL - when it detects a malformed query, it formulates an error and exits, since that's what you want in an SQL server. So for partially-correct syntax and fill-forward, tree-sitter covers the gaps.
I agree. Does anyone know much heavy lifting is done by pg_query in wrapping the Postgres code vs. Postgres in expressing that code in a manner that makes pg_query possible?
Tends to be a matter of opinion. Postgres does not expose the relevant functions, so https://github.com/pganalyze/libpg_query has to do some heavy lifting to convert their source code into a nice library. Conversely, Postgres is very well written code, in an extremely common language, with a stable release cadence, and such a long track record it is seen as the reference implementation for correctly parsing SQL.
Yeah, as one of the main authors of libpg_query, I think the primary things that make this easier is that Postgres has good abstractions internally, and the parser works independently from other parts (e.g. the community discourages adding settings that affect parser behavior).
Over the years we've only had to maintain a small set of patches on top of the Postgres source [0], together with some mocks and our libclang-based extraction logic [1]. Of course it would be nice if Postgres just packaged this directly like it packages the client libraries, but there is non-trivial effort involved to do that. From what I recall, the main issue is that error handling and memory allocations work differently in the client-side libraries (and so that would have to either also be moved out of the backend source, or use some other abstraction).
>we need to ... understand SQL syntax using a parser, Rust ecosystem has a great library called pg_query
I'm impressed both that the Rust ecosystem had the right library available and that it was high enough quality code for the author's team to use it as a reference for building their own additional functionality.
Pgquery was created by the pganalyze team for their own purposes I believe initially for features like index recommendation tooling, but immediately planned as open source. It is indeed a very high quality project with the underlying C implementation having several wrappers that exist for a number of languages[1].
Oddly, it actually originates from the Ruby ecosystem - `pganalyze` initially created `libpg_query` to parse Postgres in Ruby. `libpg_query` in C does the "magic" of reformulating Postgres's query parser into an AST generator, and serializes the resulting AST into a Protobuf. Then, `pg_query` bridges that into Rust.
Wrapping up the actual Postgres code is a brilliant idea. Then you never have to worry about updating your code to keep up with their changing API.
And the fact that this works at all is another example of why Postgres is the default data store on the internet. It's battle tested and has a massive ecosystem of support tools, like this one.
Why one would want to do that? Only reason I can think of is, so that values < 4 can be reserved for future users, like new versions of client-server that agree on 64 bit payload if the first 4 bytes are zero or alike. But there are better ways.
I'm guessing so it's never zero and can't be confused with something else, like a bunch of NULs. There are a few messages that's don't have a payload, e.g. ParseComplete.
I am wondering, why it took so long for something like this to happen for Postgres, where as Vitess existed for MySQL more than a decade. I am not an expert, but I hear Citus is not really a fair comparison with Vitess.
pgdog represents maybe just the VTGate portion of Vitess. There's a whole lot more to Vitess or any replication-on-top of a database type service than just the query router.
> More complex examples, like IN (1, 2, 3) or id != 25 can be handled as well. For the former, we can hash all values and route the query to the matching shards. For the latter, we can do the opposite.
This makes it sound like the query would be routed to all shards except the one matching HASH(25). But wouldn't id != 25 need to go to all shards, since there could be values in the id set that hash to the same shard as 25?
It would be nice if they could go more into detail about how they decide whether a query is read or write. The casual mention of this makes it sound like they originally took a too-naive approach (looking at the first word) but even ignoring CTEs, you can still have an UPDATE within a SELECT subquery, which makes the problem very hard without parsing the whole query. Perhaps the C++ code extracted from pg handles this for them and returns some metadata that indicates whether there are any writes buried in the query?
Forgot to add that to the article, but yes, we parse the statement and route SelectStmt[1] to replicas. If there is an UPDATE in the SELECT, we don't handle that at the moment, but can be easily added.
The best part I think is "pg_query is special. It doesn’t actually implement parsing SQL. It works by extracting C source code directly from Postgres and wraps it with a nice Rust interface. This allows PgDog to understand all queries that Postgres can."
One of the things that puts metaprogramming on wheels is being able to pull in grammars, schemas and such for external systems. Think how many cool tools we could have if we could do this in a more general way.
> One of the things that puts metaprogramming on wheels is being able to pull in grammars, schemas and such for external systems.
That's what I've been trying to do with: https://github.com/schemamap/schemamap
For a well-constrained Postgres schema, a deterministic SQL compiler can be built (with plenty of logic programming/constraint-solving/hypergraph-analysis) that can integrate arbitrary external systems data.
While this is NP-hard, with some clever use of embeddings and ontologies, and use of every single constraint AST within the DB to reduce the search space, this becomes a feasible problem.
For any Clojurists interested, I've packaged `pg_query`, so you can use it in your applications: https://github.com/schemamap/pg-query-clj
At the moment I'm saving up for the next development cycle, so not only PG->PG schema mappings can be solved for (JSON schema is next!). Hope this sounds interesting :)
I had such high hopes for tree-sitter but once it went all "and then, $CC -c -o" all was lost :-(
What do you mean?
Supabase's postgres LSP works in a similar way iirc.
Yes, the same way. It's all based on the extremely useful `https://github.com/pganalyze/libpg_query` project, which is where the "extracted the parser from Postgres" part comes in.
Supabase's LSP also uses tree-sitter for corrections and autocomplete, because one drawback of using the server's source is that pg_query only works on well-formed/executable SQL - when it detects a malformed query, it formulates an error and exits, since that's what you want in an SQL server. So for partially-correct syntax and fill-forward, tree-sitter covers the gaps.
I agree. Does anyone know much heavy lifting is done by pg_query in wrapping the Postgres code vs. Postgres in expressing that code in a manner that makes pg_query possible?
Tends to be a matter of opinion. Postgres does not expose the relevant functions, so https://github.com/pganalyze/libpg_query has to do some heavy lifting to convert their source code into a nice library. Conversely, Postgres is very well written code, in an extremely common language, with a stable release cadence, and such a long track record it is seen as the reference implementation for correctly parsing SQL.
Yeah, as one of the main authors of libpg_query, I think the primary things that make this easier is that Postgres has good abstractions internally, and the parser works independently from other parts (e.g. the community discourages adding settings that affect parser behavior).
Over the years we've only had to maintain a small set of patches on top of the Postgres source [0], together with some mocks and our libclang-based extraction logic [1]. Of course it would be nice if Postgres just packaged this directly like it packages the client libraries, but there is non-trivial effort involved to do that. From what I recall, the main issue is that error handling and memory allocations work differently in the client-side libraries (and so that would have to either also be moved out of the backend source, or use some other abstraction).
[0]: https://github.com/pganalyze/libpg_query/tree/17-latest/patc...
[1]: https://github.com/pganalyze/libpg_query/blob/17-latest/scri...
>we need to ... understand SQL syntax using a parser, Rust ecosystem has a great library called pg_query
I'm impressed both that the Rust ecosystem had the right library available and that it was high enough quality code for the author's team to use it as a reference for building their own additional functionality.
Pgquery was created by the pganalyze team for their own purposes I believe initially for features like index recommendation tooling, but immediately planned as open source. It is indeed a very high quality project with the underlying C implementation having several wrappers that exist for a number of languages[1].
[1] https://github.com/pganalyze/libpg_query/blob/15-latest/READ...
Oddly, it actually originates from the Ruby ecosystem - `pganalyze` initially created `libpg_query` to parse Postgres in Ruby. `libpg_query` in C does the "magic" of reformulating Postgres's query parser into an AST generator, and serializes the resulting AST into a Protobuf. Then, `pg_query` bridges that into Rust.
We're using it to rewrite queries too. It's a pretty cool library.
Wrapping up the actual Postgres code is a brilliant idea. Then you never have to worry about updating your code to keep up with their changing API.
And the fact that this works at all is another example of why Postgres is the default data store on the internet. It's battle tested and has a massive ecosystem of support tools, like this one.
> with 4 bytes added for itself
Why one would want to do that? Only reason I can think of is, so that values < 4 can be reserved for future users, like new versions of client-server that agree on 64 bit payload if the first 4 bytes are zero or alike. But there are better ways.
I'm guessing so it's never zero and can't be confused with something else, like a bunch of NULs. There are a few messages that's don't have a payload, e.g. ParseComplete.
Makes sense, thanks.
I am long on this project and excited about it.
I am wondering, why it took so long for something like this to happen for Postgres, where as Vitess existed for MySQL more than a decade. I am not an expert, but I hear Citus is not really a fair comparison with Vitess.
pgdog represents maybe just the VTGate portion of Vitess. There's a whole lot more to Vitess or any replication-on-top of a database type service than just the query router.
> pgdog represents maybe just the VTGate portion of Vitess
That’s today. The project is developing fast, so I am sure more things will be added :)
Exactly :)
> More complex examples, like IN (1, 2, 3) or id != 25 can be handled as well. For the former, we can hash all values and route the query to the matching shards. For the latter, we can do the opposite.
This makes it sound like the query would be routed to all shards except the one matching HASH(25). But wouldn't id != 25 need to go to all shards, since there could be values in the id set that hash to the same shard as 25?
You're right. Good catch, I'll fix this in the article.
Has someone done this for MySQL?
https://vitess.io/
It would be nice if they could go more into detail about how they decide whether a query is read or write. The casual mention of this makes it sound like they originally took a too-naive approach (looking at the first word) but even ignoring CTEs, you can still have an UPDATE within a SELECT subquery, which makes the problem very hard without parsing the whole query. Perhaps the C++ code extracted from pg handles this for them and returns some metadata that indicates whether there are any writes buried in the query?
Forgot to add that to the article, but yes, we parse the statement and route SelectStmt[1] to replicas. If there is an UPDATE in the SELECT, we don't handle that at the moment, but can be easily added.
[1] https://docs.rs/pg_query/6.0.0/pg_query/protobuf/struct.Sele...
Do you have to traverse the syntax tree to determine that?
Yup.