Why the Relational Knowledge Graph System (RKGS)?
Learn the benefits of using RelationalAI’s Relational Knowledge Graph System.
This guide introduces you to RelationalAI’s Relational Knowledge Graph System (RKGS).
Overview
RelationalAI’s RKGS is a new type of database management system that combines the relational paradigm with the world of knowledge graphs. The RKGS is designed in a cloud-native way and follows a data-centric approach: Data and application logic are managed together within the database.
Together with Rel — RelationalAI’s declarative modeling language — users are able to express complex modeling ideas naturally in terms of relations and first-order logic and at the same time take full advantage of all the capabilities the RKGS provides.
So what does this all mean? The relational paradigm (opens in a new tab) means that data are managed relationally, similar to a traditional SQL RDBMS. Like these systems, the RKGS uses a query optimizer to automate complex tasks such as algorithm selection. The RKGS pushes this strategy further by utilizing a fully normalized data approach called Graph Normal Form — or GNF for short. In the RKGS, all information, such as a product’s color or an employee-employer relationship, is expressed by a relation.
This approach lets the RKGS support the flexibility and extensibility required for modern knowledge graph applications: Because relations are fully-normalized, your database grows organically, eliminating the need to create new schemas or move data for new business problems.
While the RKGS excels at knowledge graph queries, it also lets you represent graph data any way you want: as tables, data frames, key-value pairs, and just about anything else you need to connect different domains across your organization.
Also, because the RKGS stores logic alongside data, you don’t need to replicate logic for each application. Intermediate results — think views in a traditional RDBMS — are stored in the database, such that changes to data automatically update results that depend on this data. Storing logic in the database ensures that the amount of data that need to be touched or created to hold intermediate results is kept to a minimum.
The RKGS also lets you apply domain knowledge to data as they come in; you can apply integrity constraints that ensure that new data conform to application logic. While many relational systems let you apply such constraints to individual tables (relations), the RKGS manages integrity constraints on the database-level and therefore lets you naturally express constraints that involve multiple relations.
The RKGS is a cloud-native system. At a basic level, this means that you don’t need to manage software or hardware on-premise. However, the RKGS also takes advantage of flexible cloud storage and the separation of data from computation afforded by the cloud.
As noted, by storing data in GNF, the RKGS reduces data duplication within your database. Building on this strategy, the RKGS, like other cloud-native systems, uses immutable data to reduce data duplication across databases. Immutable data enable capabilities such as “zero-copy cloning” and “time travel”.
The cloud nativity of the RKGS also means that data storage and query computation are decoupled, meaning storage and computation can be scaled independently of each other. This approach is ideal for data and computationally heavy applications. It is also possible to assign dedicated engines to processes or users, enabling concurrent multi-workloads without competition for computing resources.
The following sections explore the points above in more detail.
Data Centricity: Data and Logic in the Database
The RKGS lets you include logic in your database that generates new data. For example, you might have data that state that Jing is the father of Jane, and Jane is the mother of Tran. You could then declare logic stating that “all parents of parents are grandparents,” meaning that Jing is the grandfather of Tran. The latter is an example of derived data. As you make changes elsewhere in your application, the RKGS incrementally updates these derivations.
Fully Normalized Data
Because of its join algorithms, the RKGS lets you work efficiently with fully normalized relational data persisted in object storage. Instead of using wide tables with fixed schemas, you can store all data in narrow tables, which you can combine according to application needs.
Fully normalized data means that data are not duplicated across tables, and that you don’t end up with tables with lots of null or repeated values. You can reuse data across your organization that might otherwise be buried in wide tables. If you know, for example, that you need to reference a state or city’s population, there’s no need to add a population column to a table with, for example, information on sales for a state or city. Instead, you can create a fully normalized “table” to look up population each time you need it. Many RDBMSs support similar use cases; the difference with the RKGS is that you can use such tables with almost no limits.
This means that you can apply multiple logical data models to the same data set. In graph database terms, you can define multiple graphs over the same data set, meaning that each graph can have both its own queries and its own users.
RelationalAI calls this fully normalized form of data storage Graph Normal Form, or GNF for short. GNF is RelationalAI’s implementation of Sixth Normal Form (opens in a new tab). In Sixth Normal Form, each relation has one or more key columns and just one value column. There are no nulls and no empty rows.
Fully normalized data allow you to have flexible “schemas” that evolve over time. For example, adding or removing “columns” from your data does not require you to create a new table and copy all data over from the old table. In other words, having data fully normalized means that data changes only affect the data elements that need to be changed.
Fully normalized data eliminate the need to reorganize data to suit each workload. If your schema is not organized for a specific workload, you can better adapt it to unanticipated workloads: In other words, you can answer new questions without needing to copy and reorganize data into new tables. By putting data in fully normalized form, you can avoid problems such as duplication and silos, ensuring that all data maintain integrity and are available across your organization.
Storing data in fully normalized form provides maximum composability. That is, you can easily select and combine “tables” — or, more precisely, relations — for a variety of use cases. One of the benefits of the RKGS is that you don’t need to “create” tables at all; you create and interrelate relations which serve any number of use cases. Some of these may look like traditional SQL tables, but others may, for example, specify application logic.
Fully normalized data also give the query optimizer all the freedom it needs to find the most optimal query plan, ensuring efficient scaling for graph workloads as the size of the graph grows.
Immutable Data
The RKGS stores data immutably in cloud storage. Storing data immutably eliminates the need for locks and latches; instead of making changes to data, all data are written as new data. Immutable storage means that underlying data are not modified when you make changes. The RKGS creates new databases as “branches” of old ones.
As a result, in the RKGS, all databases can be cloned. You can make copies of your database in order to create a branch, preserve a historical copy for compliance or recovery purposes, and so on. It’s crucial to note, though, that such clones are not actually full copies of a database; instead, clones copy metadata that refer to your data, which themselves do not change. The result is a second “copy” that does not require additional storage. This capability is called zero-copy cloning.
In the diagram below, the two databases on the right are completely separate from one another; you can modify both the original and cloned database without affecting each other.
Immutable storage also allows you to “time travel” between different versions of your database. If you need to know what your database looked like a week ago, or a year ago, it’s possible to return to an earlier version, again without needing to use computing resources to maintain earlier versions of your database. In other words, you can recreate an earlier version of your graph without losing data.
This allows you, for example, to:
- Create databases for research and development that can be modified without affecting production data.
- Protect production workloads from changes to their data by development workloads running on a clone.
- Protect production workloads that need stable data by having these workloads operate on a clone that does not receive streaming updates.
- Create archived versions of your database for recovery or compliance.
- Access earlier versions of a database. You can “time travel” to older versions of a database that you can use, for example, for archiving and troubleshooting purposes.
- Share data for collaboration with both other RKGS users and with systems outside of the RKGS.
Efficient Graph Scaling
While graph databases are excellent at doing many-level queries — which would require lots of expensive joins in an RDBMS — because graph databases tend to use a navigational model, they have to store relationships as data. Moreover, many graph systems have difficulty isolating workloads. This presents scaling problems once datasets become large.
In contrast, relational models scale well, but working with fully normalized data requires lots of joins. Typical relational databases tend to have difficulty processing many joins, particularly those needed for graph queries. The RKGS combines the flexibility of a graph database with the scalability, data integrity, and ease-of-use of a relational database, incorporating relational query optimization into a system that can handle large numbers of joins.
Relational Query Optimization
Like most RDBMS systems, the RKGS uses a query optimizer to determine the most efficient way to run your query. The query optimizer understands the underlying structure of your query — the algebraic structure — and determines the most efficient way to run the query.
Query optimization is the main reason you can build whole applications in a declarative manner: You tell it what you need to know, and it figures out how to execute the query.
You don’t need to write algorithms in order to get your applications handling data efficiently. The query optimizer automatically determines a query plan (an algorithm) that most efficiently finds the answer to the query. As part of this optimization, the query optimizer uses semantic optimization, where queries are rewritten behind the scenes into semantically equivalent logic, to arrive at the best query plan. In particular, the data dependency analysis of the application logic and possibly the integrity constraints is a crucial step to identify and exploit functional dependencies as the query is rewritten. These rewrites can lead to significant performance speed-ups especially as the application logic grows in complexity, and at the same time they ensure a worst-case optimality (opens in a new tab).
In the diagram below, the optimizer determines that querying the minimum of functions f
and g
separately and adding these together is more efficient than adding the functions first and then querying the minimum.
The optimizer constantly makes such determinations, automatically finding the most efficient way to return results.
An Improved Relational Paradigm
While the RKGS, like an RDBMS, makes use of a query optimizer, the RKGS improves on the RDBMS model in its capacity to handle a large number of joins. In a graph database, you navigate from node to node for each possible path. That wouldn’t scale in a traditional RDBMS because every edge is a binary join between two nodes, so you have lots of very expensive join plans with exploding intermediate results. In a typical RDBMS, every time you join three or more tables, the RDBMS creates a temporary table, often larger than the two original tables, that it joins to the third table.
Repeat this operation with three, four, or five tables, and you get a join that’s very expensive in terms of both memory and computation.
However, with the RKGS, the system joins all tables at once using what are known as dovetail joins, an implementation of worst-case optimal joins (opens in a new tab):
As a result, the RKGS excels at workloads that are characterized by lots of joins. In fact, it’s often the case that the more joins you have, the faster the RKGS computes query results. That also means that the RKGS efficiently handles complex computations such as recursive joins.
The RKGS evaluates queries incrementally: As you add, change, or remove data from your database, the results of your queries update, and the computational cost of the update scales only with the size of the update and not with the size of your dataset. This paradigm makes intermediate result materialization much less expensive. So if you’re used to working with materialized views, the RKGS maintains the benefits of such views with minimal memory and compute requirements. This is called Incremental View Maintenance, or IVM.
Cloud-Native Architecture
The RKGS runs as a service in the cloud, and it has been built from the ground up to run cloud-native. In order to use the RKGS, you don’t need to install, configure, or manage hardware or software, and RelationalAI handles all software upgrades, maintenance, and tuning.
This cloud-native architecture also means that it is easy to add more computing power as you need it. In effect, the RKGS scales infinitely, because you can always add more compute.
You can even have multiple engines running on the same data, such that workloads don’t interfere with one another. That is, you can isolate workloads from one another by using separate engines for different workloads.
Because the RKGS uses object storage for data, you can scale knowledge graph data indefinitely at a low cost.
Separating Compute from Storage
The RKGS functions as both a shared-nothing and shared-disk architecture. This means that while all applications can access all data, the engines that process your data are separate from storage. Unlike database systems that run in virtual machines, you can shut down engines without losing data — because all data are stored in object storage.
This means that you can add more storage without adding compute resources (engines). Because storage is cheap, you can maintain a lot of data without using more expensive compute to manage it.
You can also increase and decrease compute size as needed. If more people want to run queries and computations over the same data, they can just keep adding engines.
In turn, this means that you can give each workload its own appropriately sized engine and run these workloads concurrently. When a workload is finished, you can delete the engine; your data remain safe in object storage.
For example, an initial bulk load might need a large engine for Project A, while a more modest engine might handle its day-to-day updates. Bulk loading can then be done into Project B, without impacting Project A’s data feed. Project A’s graph analytics can each run in their own engines — neither impacting, nor being impacted by, bulk loading.
Conclusion
This guide has given you a sense of the RKGS’s benefits. For more information on the RKGS, the RelationalAI Console, and Rel, see:
- The Quick Start guide.
- The documentation for the RelationalAI Console..
- The documentation for the RAI SDKs.
- The documentation for Rel.