Relational Algebra


#1

Some years ago I got deep into working with ActiveRecord & Arel because there was supposed to be a hot new way to compose more complex queries in the then-upcoming Rails 3.0… this was before it was officially declared off-limits as an internal interface. I eventually had to give up trying to bend it to my will.

But, I was still curious about what Arel was originally promising in its name, before any mention of “relational algebra” got purged from the README (the name is officially just “Arel Really Exasperates Logicians” now).

I eventually figured out that the Arel library itself couldn’t approach a true relational algebra, and even a patch to untangle something basic like double-negatives — NOT(NOT(...)) — was really just a spot-fix.

Regardless, I’m still interested in the idea of having a functional interface for working with relations as composable units.

Yes, it’s possible to do some restricted kinds of query composition in Rails but it’s often easier to drop directly into SQL for anything more complex: the tradeoff is that once you reduce to a SQL string, it becomes opaque to your program and is no longer reusable for combination with other relations. Also, the fact that SQL injection still leads the OWASP Top 10 should be telling us to prefer secured interfaces for building queries; dropping to raw SQL in limited cases as a last resort. (Of course there are ways of using SQL securely with ActiveRecord, but you have to always remain aware of them.)

On the other hand, the Sequel gem and similar syntax builders are excellent for writing queries securely and functionally, but they still expect us to write with the resulting SQL in mind. It still can’t compose two abstract domain concepts together, say like intersecting two scopes between associated models. The algorithm for performing the intersection has to be defined as a concern of one domain object knowing about the data semantics of other objects, not as a separate, composable concept between them.

(SQL calls itself a declarative language, but you actually do need to think about the procedural ways of fetching data, whether from subqueries or joins. Common Table Expressions can help a lot for defining some abstractions though.)

I came across a couple of research projects with higher goals for composable data relations, as more pure implementations of a relational algebra:

These projects attempt to move beyond some of the limitations of providing direct mappings to SQL from the host language, and exposing relations as first-class objects. This is a layer above the query language: Alf actually uses Sequel under the hood; Axiom has its own Arel-like AST visitor.

Examples like this one from Alf look compelling:

  # Get the cities where at least one supplier is located, provided
  # at least one part is located there too.
  cities_from_suppliers = project(suppliers, [:city])
  cities_from_parts     = project(parts, [:city])
  intersect(cities_from_suppliers, cities_from_parts)

The author of Alf summarizes the situation pretty clearly:

  • SQL has not been designed with composition and separation of concerns in mind,
  • Avoiding strong coupling between subqueries tends to be very difficult in practice,
  • Coupling hurts separation of concerns and software design.

…and an ORM or query builder doesn’t solve this.

I’m sure it would need more investigation and lots of work before implementations like this could become practical (let alone used as an underpinning for anything like ActiveRecord), but the ideas still look fascinating to me. I’m curious if anyone has worked/played with similar ideas or has opinions about the goal of actually implementing a relational algebra for data models completely above “SQL thinking”.


#2

I wrote axiom a while back while learning Relational Algebra. Originally I wrote it to underpin DataMapper (which I maintained at the time), and also because I wasn’t happy with the direction ARel was going — it was saying it’s a RA library, but was nothing more than an AR-specific SQL builder.

Nothing wrong with being an ORM specific SQL builder. I had one in DM also.

It’s been a while since i worked on axiom, but it was pretty well tested. At the time it was fully mutation covered, but I know mutant has added new operators since which will probably expose some untested corners (it’s still got 100% code coverage though, so it’s better than the average probably)

The only reason I stopped development on it was more burn-out than anything. DM took more than full-time hours to maintain, and when I worked primarily on Axiom I realized how much fun dev could be again, and how burned out I was, so I gradually tapered off my OSS work

Aside from generating SQL, one nice thing about RA is that it lends itself to some really nice optimizations. See: https://github.com/dkubb/axiom-optimizer

Obviously, I expect most modern databases like PostgreSQL, to perform these kinds of optimizations themselves, but I thought it would be good to be able to have a simplified RA AST in case there were other backend targets besides RDBMS. It also helped me to debug the SQL generator because I could construct really complex queries, optimize them down to a simple form, then output the SQL and inspect/run the result

An interesting side note is that my work on Axiom helped me find a real bug in SQLite. Which if you know anything about how that project is tested, is a real accomplishment: http://sqlite.1065341.n5.nabble.com/Outer-query-returning-results-not-found-in-subquery-td13156.html

I found this bug using a combination of mutation testing (to make sure my code did what it was supposed to do), fuzz testing. I’d generate random queries, execute the queries in the database and in-memory, then compare the results. I ran 10’s of millions of iterations, and found lots of bugs in my SQL generator which were fixed, and a few in different database engines.

I also would throw in the optimizer too for both branches — so in effect, I’d generate one random query, execute it in the database with and without optimization, and execute it in memory with an without optimizations too (so 4 separate executions). The idea is that all 4 variants should always agree on the result in order to be correct — I kept iterating until I got to that point.

I highly encourage anyone interested to learn more about Relational Algebra, especially if you ever write SQL. It really can take your SQL skills to the next level. You can learn a lot from reading axiom and alf, but I’d recommend the book “SQL and Relational Theory” if you really want a deep understanding of the underlying theory, along with more practical application in everyday SQL usage.

Dan Kubb
@dkubb