Data Modeling: Graph Normal Form
This concept guide explains how to work with data in Graph Normal Form (GNF).
Introduction
RAI’s Relational Knowledge Graph System (RKGS) allows users to work with data stored in a highly normalized format that RelationalAI calls Graph Normal Form (GNF). This guide describes GNF and explains how to perform common operations with GNF relations.
The next two sections introduce the two requirements for a database to be considered Graph Normal Form:
- Indivisibility of Facts. Each tuple in each relation must correspond to a single, indivisible fact. In other words, each relation must have at most one value column.
- Things Not Strings. The identity of each element in the database must reflect the identity of the concept it represents. In other words, elements in the database should be equal if and only if they are intended to represent the same real-world object.
The remaining sections discuss the advantages of Graph Normal Form and illustrate how to use Rel to work with data in Graph Normal Form.
Indivisibility of Facts
Wide Versus Tall Tables
Suppose that a company wants to record how many nights of hotel accommodation it books for each of its salespersons in each month. One approach is to put the data into a table whose columns correspond to months:
name | Jan | Feb | Mar | Apr |
---|---|---|---|---|
Alice | 3 | 6 | 4 | 9 |
Beatrice | 5 | 2 | 10 | 1 |
Carlos | 4 | 4 | 5 | 3 |
As discussed in Relational Data Modeling, Relational Database Management Systems (RDBMSs) aren’t designed to handle tables in this format.
The problem is that the schema causes the number of columns to grow in an open-ended way, and RDBMSs aren’t designed to accommodate very large numbers of columns or routine changes to the number of columns. They are designed to handle large numbers of rows and frequent row insertions.
One way to solve this problem is to consider the table’s template:
[name] stayed in a hotel [Jan] nights in January, [Feb] nights in February, [Mar] nights in March, [Apr] nights in April, …
Each row in the table provides many facts, so its template is a highly compound and unwieldy. Splitting the template into individual facts makes it much nicer:
[name] stayed in a hotel [nights] nights in January. [name] stayed in a hotel [nights] nights in February. [name] stayed in a hotel [nights] nights in March. [name] stayed in a hotel [nights] nights in April.
This collection of templates can be simplified further by treating the month as another field:
[name] stayed in a hotel [nights] nights in [month].
This three-field template is adequate to capture all of the rows that will ever be added to this table. Furthermore, this template is indivisible. In other words, a fact of the form [name] stayed in a hotel [nights] nights in [month] cannot be subdivided into multiple simpler facts.
This transformation from many columns to three columns is called stacking from wide form to tall form:
name | month | days |
---|---|---|
Alice | Jan | 3 |
Alice | Feb | 6 |
Alice | Mar | 4 |
Alice | Apr | 9 |
Beatrice | Jan | 5 |
Beatrice | Feb | 2 |
Beatrice | Mar | 10 |
Beatrice | Apr | 1 |
Carlos | Jan | 4 |
Carlos | Feb | 4 |
Carlos | Mar | 5 |
Carlos | Apr | 3 |
Some operations, like summing across rows in the wide table, may seem more difficult in the tall table. However, you can get the data from a row in the original table by filtering the rows of the tall table based on the name field. Since RDBMSs are designed to filter rows and sum along columns efficiently — and aren’t set up to sum along rows conveniently — the tall version is actually an improvement over the wide version even in this respect.
Stacking 3NF Tables
The following table of driver’s license data is a typical Third Normal Form (3NF) table:
person | date_of_birth | address | eye_color | sex | height | organ_donor |
---|---|---|---|---|---|---|
Alice | 1996-04-17 | 13 Birch Street | brown | F | 165 cm | true |
Beatrice | 1975-08-11 | 9 Summer Street | brown | F | 162 cm | false |
Carlos | 1981-11-06 | 4 Florence Ave | hazel | M | 170 cm | true |
The table has one key column — person1 — and six value columns.
While the table does have several columns, the number of columns is not poised to increase indefinitely because driver’s license formats seldom change. Traditional RDMBSs are designed to efficiently handle tables like this one.
On the other hand, this table’s predicate is still compound:
[person] was born on [date_of_birth] and lives at [address] and has eye color [eye_color] and is [sex] and has a height of [height] and [ is | is not ] an organ donor.
This complex template may be distilled into a three-field template, following the approach detailed in the Wide Versus Tall Tables section:
The [field] field of [person]‘s driver’s license says [value].
The table for this simplified template looks like this:
field | person | value |
---|---|---|
date_of_birth | Alice | 1996-04-17 |
date_of_birth | Beatrice | 1975-08-11 |
date_of_birth | Carlos | 1981-11-06 |
address | Alice | 13 Birch Street |
address | Beatrice | 9 Summer Street |
address | Carlos | 4 Florence Ave |
eye_color | Alice | brown |
eye_color | Beatrice | brown |
eye_color | Carlos | hazel |
sex | Alice | F |
sex | Beatrice | F |
sex | Carlos | M |
height | Alice | 165 cm |
height | Beatrice | 162 cm |
height | Carlos | 170 cm |
organ_donor | Alice | true |
organ_donor | Beatrice | false |
organ_donor | Carlos | true |
The second and third columns in this table form a collection of six two-column relations, one for each of the six columns in the original 3NF table. The names of these six relations — which are column headers in the original 3NF relations — are indicated in the first column of the stacked table.
Here’s the same table with different styling to make this structure easier to see:
field | person | value |
---|---|---|
Alice | 1996-04-17 | |
date_of_birth | Beatrice | 1975-08-11 |
Carlos | 1981-11-06 | |
Alice | 13 Birch Street | |
address | Beatrice | 9 Summer Street |
Carlos | 4 Florence Ave | |
Alice | brown | |
eye_color | Beatrice | green |
Carlos | hazel | |
Alice | F | |
sex | Beatrice | F |
Carlos | M | |
Alice | 165 cm | |
height | Beatrice | 162 cm |
Carlos | 170 cm | |
Alice | true | |
organ_donor | Beatrice | false |
Carlos | true |
You can represent this relation in Rel using a module:
// read query
module drivers_license
def date_of_birth = {
"Alice", 1996-04-07;
"Beatrice", 1975-08-11;
"Carlos", 1981-11-06;
}
def address = {
"Alice", "13 Birch Street";
"Beatrice", "9 Summer Street";
"Carlos", "4 Florence Ave";
}
def eye_color = {
"Alice", "brown";
"Beatrice", "brown";
"Carlos", "hazel";
}
def sex = {
"Alice", "F";
"Beatrice", "F";
"Carlos", "M";
}
def height = {
"Alice", 165;
"Beatrice", 162;
"Carlos", 170;
}
def organ_donor = {
"Alice", true;
"Beatrice", false;
"Carlos", true;
}
end
This syntax parallels the styled version of the table above since each relation name — date_of_birth
, address
, eye_color
, and so on — appears only once.
Graphs
A table whose columns correspond to field, key, and value may alternatively be visualized as follows:
- Depict the each key or value element as a point.
- For each row, draw a line from the point corresponding to the second element to the point corresponding to the third element.
- Distinguish the lines by color based on the value in the first column:
The points are also called nodes or vertices, and the lines connecting them are called relationships or edges. A collection of nodes and relationships is called a graph.
Because of the close relationship between the stacked relation above and the graph visualization, the stacked relation is called the Graph Normal Form representation of the original 3NF table.
RDF and Labeled Property Graphs
There are two popular ways of organizing graph databases. The one depicted in the graph above is called RDF, which stands for Resource Description Framework. The typical format for a table of RDF triples is subject-predicate-object:
subject | predicate | object |
---|---|---|
Alice | has_date_of_birth | 1996-04-17 |
Beatrice | has_date_of_birth | 1975-08-11 |
Carlos | has_date_of_birth | 1981-11-06 |
Alice | has_address | 13 Birch Street |
Beatrice | has_address | 9 Summer Street |
Carlos | has_address | 4 Florence Ave |
Alice | has_eye_color | brown |
Beatrice | has_eye_color | brown |
Carlos | has_eye_color | hazel |
Alice | has_sex | F |
Beatrice | has_sex | F |
Carlos | has_sex | M |
Alice | has_height | 165 cm |
Beatrice | has_height | 162 cm |
Carlos | has_height | 170 cm |
Alice | is_organ_donor | true |
Beatrice | is_organ_donor | false |
Carlos | is_organ_donor | true |
A collection of RDF triples is called an RDF triple store.
This approach is very simple, but adhering strictly to the three-column format does require substantial refactoring whenever data need to be associated with an existing row. For example, adding information about the date on which Alice moved to 13 Birch Street would require reifying the notion of a move:
subject | predicate | object |
---|---|---|
Alice | made_move | move_1 |
move_1 | is_to_address | 13 Birch Street |
move_1 | occurred_on | 2013 June 01 |
Reification is reasonable in many situations, but having to reify every time you need to add information about an edge can get quite cumbersome, since it introduces further layers of connections that must be navigated when performing queries.
For example, finding Alice’s address in the reified schema requires making the connection through move_1
, whereas in the original schema that information was available in a single row.
An extension to RDF called RDF* (RDF-star) has been proposed to alleviate this problem. RDF* allows triples to be used as a subject or object in other triples. For example, Alice’s move date could be added to the database above by inserting one additional triple:
<< Alice
:has_address
"13 Birch Street" >>
:as_of
2013-06-01
Note that the << >>
syntax is used for grouping in this notation.
A triple that has a triple as a node can, in turn, be used as a subject or object in another triple. For example, suppose Alice waited a month to provide notification of her move. The following triple records the fact that Alice’s move became known as of the first of July:
<< << Alice
:has_address
"13 Birch Street" >>
as_of
2013-06-01 >>
:known_since
2013-07-01
Labeled Property Graphs offer another solution to the problem: They allow you to associate an arbitrary collection of key-value pairs to any node or edge in the graph. These associated key-value pairs are called properties.
One important distinction between the RDF and LPG approaches is that properties in a Labeled Property Graph are auxiliary data, not nodes in the graph. This makes it more difficult to establish any connections that may exist between properties and other nodes or edges in the graph.
Because Rel allows users to construct arbitary relations, its data model is more general than RDF or LPG. Graphs in any of these formats — RDF, RDF*, or LPG — can be modeled in Rel.
Other Kinds of Tables
This section addresses some other situations that arise when dealing with tabular data.
A Compound Key with Multiple Value Columns
The 3NF table in the Stacking 3NF Tables section has one key column and several value columns. Here’s a table with more than one key column:
painting | appraiser | valuation | location |
---|---|---|---|
Mona Lisa | Abbott | $860M | New York |
Mona Lisa | Bartle | $920M | Berlin |
Starry Night | Bartle | $120M | Tokyo |
Sunflowers | Bartle | $45M | London |
The template corresponding to this table is as follows:
[painting] was appraised by [appraiser] for [valuation] in [location].
Assuming that each appraiser appraises each painting at most once, the key of this table consists of the painting and appraiser columns, while valuation and location are value columns.
The template is not indivisible since it can be split into two templates as follows:
[painting] was appraised by [appraiser] for [valuation]. [painting] was appraised by [appraiser] in [location].
There are two ways to transform the data to obtain a table whose template is indivisible:
-
Stack the table to get a relation with tuples of the form
(field, key₁, key₂, value)
:field painting appraiser value valuation Mona Lisa Abbott $860M valuation Mona Lisa Bartle $920M valuation Starry Night Bartle $120M valuation Sunflowers Bartle $45M location Mona Lisa Abbott New York location Mona Lisa Bartle Berlin location Starry Night Bartle Tokyo location Sunflowers Bartle London -
Introduce a second relation that reifies the compound key, which in this case corresponds to the real-world concept of an appraisal. Suppose that the appraisals are identified using the numbers 1, 2, 3, and 4, for example:
field appraisal value painting 1 Mona Lisa painting 2 Mona Lisa painting 3 Starry Night painting 4 Sunflowers appraiser 1 Abbott appraiser 2 Bartle appraiser 3 Bartle appraiser 4 Bartle valuation 1 $860M valuation 2 $920M valuation 3 $120M valuation 4 $45M location 1 New York location 2 Berlin location 3 Tokyo location 4 London
In Rel the second approach is generally preferred because it provides more query flexibility, but either approach may be used.
No Value Columns
Some relations don’t have any value columns. For example, consider the following template and table:
[winning_team] defeated [losing_team] this week
winning_team | losing_team |
---|---|
Liberty | Mystics |
Aces | Sky |
Lynx | Fever |
Storm | Mercury |
Sun | Dream |
Aces | Mercury |
Sky | Fever |
Lynx | Mystics |
Wings | Sparks |
Neither column may be regarded as the table’s key, since a team can defeat multiple opponents or be defeated by multiple opponents. Therefore, the key consists of both columns. By the definition of a key, a table’s template is necessarily indivisible if its key consists of all of its columns.
Summary
The lessons of the preceding sections may be distilled into one main idea:
Ensuring the indivisibility of a relation’s template is the same as ensuring that the table has at most one value column.
For any number greater than 1, a table with value columns can be split into relations with one value column each.
Things Not Strings
The modeling demonstrated in the examples in the Indivisibility of Facts section is subject to a few criticisms:
- The driver’s license table uses each person’s name as a unique identifier for the person. This technique only works if it can be ensured that no two people have the same name, and that is not the case in most practical situations.
- The four appraisals are identified using the surrogate keys (opens in a new tab)
1
,2
,3
, and4
. While this approach works, it isn’t entirely satisfactory because it can be difficult when viewing the contents of tables to recognize that a value like3
refers to an appraisal. - Monetary values like $860M are represented using strings. This makes them difficult to do arithmetic with. Representing them with numbers is not entirely satisfactory either, because information like the currency denomination is lost.
In short, the elements in a database should ideally be represented in a way that encodes their meaning. This idea was popularized in 2012 via the blog post Introducing the Knowledge Graph: Things, Not Strings (opens in a new tab).
Identity
The idea of meaning in a database refers in large part to the ability to establish connections that involve multiple relationships. If you know that Norah Jones’ father is Ravi Shankar and Anoushka Shankar’s father is also Ravi Shankar, then you can say that Norah Jones and Anoushka Shankar are siblings. If two users of a restaurant recommendation site like many of the same restaurants, then you can draw the conclusion that they have a high preference similarity.
A prerequisite for being able to establish connections in a database is the ability to uniquely identify the real-world objects being represented. For example, if you only know that a person named Norah Jones has a father named Ravi Shankar, and similarly for Anoushka Shankar, then you can no longer draw any conclusion about the relationship between Norah and Anoushka.
Two people in conversation may have an understanding that the Ravi Shankar in question is the only one who is a famous sitarist, in which case they can draw the conclusion that Norah and Anoushka are siblings. The database, likewise, should be afforded the ability to determine whether two things are the same.
The RDF solution to this problem is to use agreed-upon universal identifiers called IRIs (Internationalized Resource Identifiers) to distinguish things.
For example, Tim Berners-Lee — the inventor of the World Wide Web — can be identified in any RDF triple store using the string https://www.w3.org/People/Berners-Lee
.
The LPG solution is to supply each node and edge inserted into a database with a unique ID, either using an automatically incremented counter or a randomly generated universally unique ID. With this approach, every node and edge can always be distinguished from any other. On the other hand, this approach doesn’t help with tasks like merging two databases, since the two databases may use different IDs for the same objects.
Things Not Strings in Rel
One key aspect to Rel’s approach to the things-not-strings problem is a pair of constructs called entity types and value types. Instead of storing plain strings or integers in the database, you can store elements that carry additional meaning. Here’s an example showing one way of handling the driver’s license data using entities for the people and value types for the other elements:
// model
def Unit = {:cm; :inch}
value type Distance {(Unit, Number)}
value type EyeColor = {
"amber"; "blue"; "brown"; "gray"; "green"; "hazel"
}
value type Sex {"F"; "M"; "X"}
value type Address { String }
def ^Person { make_entity_hash[:Person, {(String, std::datetime::Date)}] }
In this code cell, the type Distance
is defined so that it consists of a unit — either centimeters or inches — and a number.
The type EyeColor
is defined by explicitly enumerating its six possible values, and similarly for Sex
.
The type Address
is defined in terms of a single string.
The Person
type is different from the others, because a person has an identity which persists despite changes in identifying data.
Therefore, an entity type is used for Person
.
The entity type declaration defines a constructor ^Person
that can be used to construct Person
instances, as illustrated below.
See Entities for more details.
The following code block uses name and birthdate as identifying data to define person entities for Alice, Beatrice, and Carlos. It also creates base relations to map these person entities to both name and date of birth:
// write query
def person_ids = {
"Alice", 1996-04-07;
"Beatrice", 1975-08-11;
"Carlos", 1981-11-06
}
def insert:person:name = {
(^Person[name, dob], name) from name, dob where person_ids(name, dob)
}
def insert:person:dob = {
(^Person[name, dob], dob) from name, dob where person_ids(name, dob)
}
Lastly, we can use the relations above to create a Graph Normal Form version of the table:
// write query
def source:data = """
person,date_of_birth,address,eye_color,sex,height,organ_donor
Alice,1996-04-17,13 Birch Street,brown,F,165 cm,true
Beatrice,1975-08-11,9 Summer Street,brown,F,162 cm,false
Carlos,1981-11-06,4 Florence Ave,hazel,M,170 cm,true
"""
def csv = load_csv[source]
def key_person(key, p) {
csv:person(key, name)
and person:name(p, name)
from name
}
module properties
def address(p, a) {
key_person(key, p)
and csv:address(key, address_string)
and ^Address(address_string, a)
from key, address_string
}
def eye_color(p, c) {
key_person(key, p)
and csv:eye_color(key, eye_color_string)
and ^EyeColor(eye_color_string, c)
from key, eye_color_string
}
def sex(p, s) {
key_person(key, p)
and csv:sex(key, sex_string)
and ^Sex(sex_string, s)
from key, sex_string
}
def height(p, h) {
key_person(key, p)
and csv:height(key, height_string)
and ^Distance(:cm, parse_float[string_split[" ", height_string][1]], h)
from key, height_string
}
def organ_donor(p) {
key_person(key, p)
and csv:organ_donor(key, "true")
from key
}
end
def insert:person = properties
// read query
def output(field, name, value) = {
person:name(p, name)
and person(field, p, value)
from p
}
This approach has several advantages:
- Representing eye color using an
EyeColor
value type ensures that eye color values are not mistaken for something else in the database. For example,hazel
as a person’s name isn’t the same ashazel
the eye color. - Representing length measures in a way that incorporates units helps avoid arithmetic errors involving unit conversions.
- While value types are determined by their associated data, entities in the real world are not.
Changing a person’s name does not make them a different person, for example.
Representing people as entities in the language makes it possible to respect this aspect of real-world identity.
For example, if a person were to change their name, the database should continue to identify that person using the ID that was initially created for them, but the relation
person:name
should be updated to reflect their new name.
Advantages of Graph Normal Form
Graph Normal Form offers several advantages over traditional 3NF tables.
Semantic Stability
Consider the 3NF table discussed in the previous section:
person | date_of_birth | address | eye_color | sex | height | organ_donor |
---|---|---|---|---|---|---|
Alice | 1996-04-17 | 13 Birch Street | brown | F | 165 cm | true |
Beatrice | 1975-08-11 | 9 Summer Street | brown | F | 162 cm | false |
Carlos | 1981-11-06 | 4 Florence Ave | hazel | M | 170 cm | true |
If the Department of Motor Vehicles wishes to keep records on your previous addresses as well as your current address, they can’t store that information in a table like this one. Instead, the standard solution is to remove address from this table and put it in its own table:
person | date_of_birth | eye_color | sex | height | organ_donor |
---|---|---|---|---|---|
Alice | 1996-04-17 | brown | F | 165 cm | true |
Beatrice | 1975-08-11 | brown | F | 162 cm | false |
Carlos | 1981-11-06 | hazel | M | 170 cm | true |
person | address |
---|---|
Alice | 13 Birch Street |
Alice | 1100 Davis Drive |
Beatrice | 9 Summer Street |
Carlos | 4 Florence Ave |
Separating the tables is essential because it is important to avoid having to repeat information for all of the other fields when there are two address rows for the same person.
The new address relationship is called a many-to-many relationship, since each person may live at multiple addresses, and each address may have multiple residents. The original schema is based on the assumption that the relationship is many-to-one. In other words, the assumption was that each person lives at a single address.
Having to make schema changes to turn a one-to-many relationship into a many-to-many relationship can pose significant problems because queries which have been written for the database must be changed to accommodate the new schema. There may be many such queries in use, so updating them all might require quite a lot of work. Because GNF decouples columns as much as possible already, it affords protection against such breaking changes.
Missing Values
One advantage of using Graph Normal Form is that it eliminates the need to store missing values. Suppose, for example, that Beatrice’s date of birth is unknown.2 This might be indicated by leaving a blank in that position in the table:
person | date_of_birth | address | eye_color | sex | height | organ_donor |
---|---|---|---|---|---|---|
Alice | 1996-04-17 | 13 Birch Street | brown | F | 165 cm | true |
Beatrice | 9 Summer Street | brown | F | 162 cm | false | |
Carlos | 1981-11-06 | 4 Florence Ave | hazel | M | 170 cm | true |
In Graph Normal Form, there is no need to do that, since the corresponding date-of-birth row can be dropped:
field | person | value |
---|---|---|
date_of_birth | Alice | 1996-04-17 |
date_of_birth | Carlos | 1981-11-06 |
address | Alice | 13 Birch Street |
address | Beatrice | 9 Summer Street |
address | Carlos | 4 Florence Ave |
eye_color | Alice | brown |
eye_color | Beatrice | brown |
eye_color | Carlos | hazel |
⋮ | ⋮ | ⋮ |
Since many errors arise from the often surprising behavior of the three-valued logic (opens in a new tab) that some relational database management systems must implement because of their support for null values, which represent empty cells, it’s easier to write transparent and predictable queries for data in Graph Normal Form.
Edge Metadata
As discussed in the section on RDF and Labeled Property Graphs, it is often helpful to record auxiliary information about specific facts in a database. For example, you might want to store a date on which the fact became known, a source for the fact, or which language is used to express the fact.
If such metadata need to be recorded for several different facts stored in the same 3NF table, it is quite inconvenient to do that. You would need to create several new columns — one for each column that has some associated metadata — and you would also have to keep track of which auxiliary columns apply to which facts. Relational database management systems are not designed to handle that kind of schema information.
The graph database systems discussed in the section on RDF and Labeled Property Graphs offer better solutions to this problem, but those approaches also have drawbacks. For example, querying an RDF* triple store is more challenging due to the nested nature of the edges. Properties in an LPG are useful for associating metadata with edges in the graph, but because properties are not themselves nodes or edges in the graph, it is not possible to associate metadata with a specific property.
In a GNF collection of relations, recording auxiliary data for each fact is more straightforward since individual facts correspond to individual tuples in a relation. This makes it possible to record metadata about a particular class of facts using an additional column in the corresponding relation. Such a relation is still in Graph Normal Form as long as the metadata column is part of the key. If the metadata column is a value column, you can instead put it in a separate relation.
Conclusion
In summary, Graph Normal Form is a normalized format for relational data which also generalizes traditional graph database formats. A collection of relations is in Graph Normal Form if the following conditions are met:
- Normalization. Every relation has at most one non-key column.
- Things not strings. Elements in the database are equal if and only if they are meant to refer to the same thing in the real world.
Representing data in Graph Normal Form provides several advantages relative to 3NF tables and other graph data formats. It encourages the expression of reasoning in logical terms using the fundamental relationships in the dataset.
Footnotes
-
In a real application, you would use the driver’s license number as a key, since the number uniquely identifies a person. For convenience, this example uses the person’s name instead. This point is discussed more extensively later. ↩
-
Beatrice’s missing birthday might seem like a problem for the construction of an entity ID for her. However, it is possible to define an additional constructor which just uses her name:
// write query entity type Person = String def insert:person:name(p, "Beatrice") = ^Person("Beatrice", p)
As long as you check that the ID being generated for a person is not already in use for another person in the database, you can use fewer fields or more fields as necessary to ensure that unique IDs are generated for distinct entities in the database. You can use an integrity constraint to perform this check. ↩