Data Modeling: Graph Normal Form

This concept guide explains how to work with data in Graph Normal Form (GNF).

Introduction

RAI’s Relational Knowledge Graph System (RKGS) allows users to work with data stored in a highly normalized format that RelationalAI calls Graph Normal Form (GNF). This guide describes GNF and explains how to perform common operations with GNF relations.

The next two sections introduce the two requirements for a database to be considered Graph Normal Form:

Indivisibility of Facts. Each tuple in each relation must correspond to a single, indivisible fact. In other words, each relation must have at most one value column.
Things Not Strings. The identity of each element in the database must reflect the identity of the concept it represents. In other words, elements in the database should be equal if and only if they are intended to represent the same real-world object.

The remaining sections discuss the advantages of Graph Normal Form and illustrate how to use Rel to work with data in Graph Normal Form.

Indivisibility of Facts

Wide Versus Tall Tables

Suppose that a company wants to record how many nights of hotel accommodation it books for each of its salespersons in each month. One approach is to put the data into a table whose columns correspond to months:

name	Jan	Feb	Mar	Apr
Alice	3	6	4	9
Beatrice	5	2	10	1
Carlos	4	4	5	3

As discussed in Relational Data Modeling, Relational Database Management Systems (RDBMSs) aren’t designed to handle tables in this format.

The problem is that the schema causes the number of columns to grow in an open-ended way, and RDBMSs aren’t designed to accommodate very large numbers of columns or routine changes to the number of columns. They are designed to handle large numbers of rows and frequent row insertions.

One way to solve this problem is to consider the table’s template:

[name] stayed in a hotel [Jan] nights in January, [Feb] nights in February, [Mar] nights in March, [Apr] nights in April, …

Each row in the table provides many facts, so its template is a highly compound and unwieldy. Splitting the template into individual facts makes it much nicer:

[name] stayed in a hotel [nights] nights in January. [name] stayed in a hotel [nights] nights in February. [name] stayed in a hotel [nights] nights in March. [name] stayed in a hotel [nights] nights in April.

This collection of templates can be simplified further by treating the month as another field:

[name] stayed in a hotel [nights] nights in [month].

This three-field template is adequate to capture all of the rows that will ever be added to this table. Furthermore, this template is indivisible. In other words, a fact of the form [name] stayed in a hotel [nights] nights in [month] cannot be subdivided into multiple simpler facts.

This transformation from many columns to three columns is called stacking from wide form to tall form:

name	month	days
Alice	Jan	3
Alice	Feb	6
Alice	Mar	4
Alice	Apr	9
Beatrice	Jan	5
Beatrice	Feb	2
Beatrice	Mar	10
Beatrice	Apr	1
Carlos	Jan	4
Carlos	Feb	4
Carlos	Mar	5
Carlos	Apr	3

Some operations, like summing across rows in the wide table, may seem more difficult in the tall table. However, you can get the data from a row in the original table by filtering the rows of the tall table based on the name field. Since RDBMSs are designed to filter rows and sum along columns efficiently — and aren’t set up to sum along rows conveniently — the tall version is actually an improvement over the wide version even in this respect.

Stacking 3NF Tables

The following table of driver’s license data is a typical Third Normal Form (3NF) table:

person	date_of_birth	address	eye_color	sex	height	organ_donor
Alice	1996-04-17	13 Birch Street	brown	F	165 cm	true
Beatrice	1975-08-11	9 Summer Street	brown	F	162 cm	false
Carlos	1981-11-06	4 Florence Ave	hazel	M	170 cm	true

The table has one key column — person¹ — and six value columns.

While the table does have several columns, the number of columns is not poised to increase indefinitely because driver’s license formats seldom change. Traditional RDMBSs are designed to efficiently handle tables like this one.

On the other hand, this table’s predicate is still compound:

[person] was born on [date_of_birth] and lives at [address] and has eye color [eye_color] and is [sex] and has a height of [height] and [ is | is not ] an organ donor.

This complex template may be distilled into a three-field template, following the approach detailed in the Wide Versus Tall Tables section:

The [field] field of [person]‘s driver’s license says [value].

The table for this simplified template looks like this:

field	person	value
date_of_birth	Alice	1996-04-17
date_of_birth	Beatrice	1975-08-11
date_of_birth	Carlos	1981-11-06
address	Alice	13 Birch Street
address	Beatrice	9 Summer Street
address	Carlos	4 Florence Ave
eye_color	Alice	brown
eye_color	Beatrice	brown
eye_color	Carlos	hazel
sex	Alice	F
sex	Beatrice	F
sex	Carlos	M
height	Alice	165 cm
height	Beatrice	162 cm
height	Carlos	170 cm
organ_donor	Alice	true
organ_donor	Beatrice	false
organ_donor	Carlos	true

The second and third columns in this table form a collection of six two-column relations, one for each of the six columns in the original 3NF table. The names of these six relations — which are column headers in the original 3NF relations — are indicated in the first column of the stacked table.

Here’s the same table with different styling to make this structure easier to see:

field	person	value
	Alice	1996-04-17
date_of_birth	Beatrice	1975-08-11
	Carlos	1981-11-06

	Alice	13 Birch Street
address	Beatrice	9 Summer Street
	Carlos	4 Florence Ave

	Alice	brown
eye_color	Beatrice	green
	Carlos	hazel

	Alice	F
sex	Beatrice	F
	Carlos	M

	Alice	165 cm
height	Beatrice	162 cm
	Carlos	170 cm

	Alice	true
organ_donor	Beatrice	false
	Carlos	true

You can represent this relation in Rel using a module:

// read query
 
module drivers_license
    def date_of_birth = {
        "Alice", 1996-04-07;
        "Beatrice", 1975-08-11;
        "Carlos", 1981-11-06;
    }
    def address = {
        "Alice", "13 Birch Street";
        "Beatrice", "9 Summer Street";
        "Carlos", "4 Florence Ave";
    }
    def eye_color = {
        "Alice", "brown";
        "Beatrice", "brown";
        "Carlos", "hazel";
    }
    def sex = {
        "Alice", "F";
        "Beatrice", "F";
        "Carlos", "M";
    }
    def height = {
        "Alice", 165;
        "Beatrice", 162;
        "Carlos", 170;
    }
    def organ_donor = {
        "Alice", true;
        "Beatrice", false;
        "Carlos", true;
    }
end

This syntax parallels the styled version of the table above since each relation name — date_of_birth, address, eye_color, and so on — appears only once.

Graphs

A table whose columns correspond to field, key, and value may alternatively be visualized as follows:

Depict the each key or value element as a point.
For each row, draw a line from the point corresponding to the second element to the point corresponding to the third element.
Distinguish the lines by color based on the value in the first column:

The points are also called nodes or vertices, and the lines connecting them are called relationships or edges. A collection of nodes and relationships is called a graph.

Because of the close relationship between the stacked relation above and the graph visualization, the stacked relation is called the Graph Normal Form representation of the original 3NF table.

RDF and Labeled Property Graphs

There are two popular ways of organizing graph databases. The one depicted in the graph above is called RDF, which stands for Resource Description Framework. The typical format for a table of RDF triples is subject-predicate-object:

subject	predicate	object
Alice	has_date_of_birth	1996-04-17
Beatrice	has_date_of_birth	1975-08-11
Carlos	has_date_of_birth	1981-11-06
Alice	has_address	13 Birch Street
Beatrice	has_address	9 Summer Street
Carlos	has_address	4 Florence Ave
Alice	has_eye_color	brown
Beatrice	has_eye_color	brown
Carlos	has_eye_color	hazel
Alice	has_sex	F
Beatrice	has_sex	F
Carlos	has_sex	M
Alice	has_height	165 cm
Beatrice	has_height	162 cm
Carlos	has_height	170 cm
Alice	is_organ_donor	true
Beatrice	is_organ_donor	false
Carlos	is_organ_donor	true

A collection of RDF triples is called an RDF triple store.

This approach is very simple, but adhering strictly to the three-column format does require substantial refactoring whenever data need to be associated with an existing row. For example, adding information about the date on which Alice moved to 13 Birch Street would require reifying the notion of a move:

subject	predicate	object
Alice	made_move	move_1
move_1	is_to_address	13 Birch Street
move_1	occurred_on	2013 June 01

Reification is reasonable in many situations, but having to reify every time you need to add information about an edge can get quite cumbersome, since it introduces further layers of connections that must be navigated when performing queries. For example, finding Alice’s address in the reified schema requires making the connection through move_1, whereas in the original schema that information was available in a single row.

An extension to RDF called RDF* (RDF-star) has been proposed to alleviate this problem. RDF* allows triples to be used as a subject or object in other triples. For example, Alice’s move date could be added to the database above by inserting one additional triple:

<< Alice
   :has_address
   "13 Birch Street" >>
:as_of
2013-06-01

Note that the << >> syntax is used for grouping in this notation.

A triple that has a triple as a node can, in turn, be used as a subject or object in another triple. For example, suppose Alice waited a month to provide notification of her move. The following triple records the fact that Alice’s move became known as of the first of July:

<< << Alice
      :has_address
      "13 Birch Street" >>
   as_of
   2013-06-01 >>
:known_since
2013-07-01

Labeled Property Graphs offer another solution to the problem: They allow you to associate an arbitrary collection of key-value pairs to any node or edge in the graph. These associated key-value pairs are called properties.

One important distinction between the RDF and LPG approaches is that properties in a Labeled Property Graph are auxiliary data, not nodes in the graph. This makes it more difficult to establish any connections that may exist between properties and other nodes or edges in the graph.

Because Rel allows users to construct arbitary relations, its data model is more general than RDF or LPG. Graphs in any of these formats — RDF, RDF*, or LPG — can be modeled in Rel.

Other Kinds of Tables

This section addresses some other situations that arise when dealing with tabular data.

A Compound Key with Multiple Value Columns

The 3NF table in the Stacking 3NF Tables section has one key column and several value columns. Here’s a table with more than one key column:

painting	appraiser	valuation	location
Mona Lisa	Abbott	$860M	New York
Mona Lisa	Bartle	$920M	Berlin
Starry Night	Bartle	$120M	Tokyo
Sunflowers	Bartle	$45M	London

The template corresponding to this table is as follows:

[painting] was appraised by [appraiser] for [valuation] in [location].

Assuming that each appraiser appraises each painting at most once, the key of this table consists of the painting and appraiser columns, while valuation and location are value columns.

The template is not indivisible since it can be split into two templates as follows:

[painting] was appraised by [appraiser] for [valuation]. [painting] was appraised by [appraiser] in [location].

There are two ways to transform the data to obtain a table whose template is indivisible:

Stack the table to get a relation with tuples of the form (field, key₁, key₂, value):

field	painting	appraiser	value
valuation	Mona Lisa	Abbott	$860M
valuation	Mona Lisa	Bartle	$920M
valuation	Starry Night	Bartle	$120M
valuation	Sunflowers	Bartle	$45M
location	Mona Lisa	Abbott	New York
location	Mona Lisa	Bartle	Berlin
location	Starry Night	Bartle	Tokyo
location	Sunflowers	Bartle	London

Introduce a second relation that reifies the compound key, which in this case corresponds to the real-world concept of an appraisal. Suppose that the appraisals are identified using the numbers 1, 2, 3, and 4, for example:

field	appraisal	value
painting	1	Mona Lisa
painting	2	Mona Lisa
painting	3	Starry Night
painting	4	Sunflowers
appraiser	1	Abbott
appraiser	2	Bartle
appraiser	3	Bartle
appraiser	4	Bartle
valuation	1	$860M
valuation	2	$920M
valuation	3	$120M
valuation	4	$45M
location	1	New York
location	2	Berlin
location	3	Tokyo
location	4	London

In Rel the second approach is generally preferred because it provides more query flexibility, but either approach may be used.

No Value Columns

Some relations don’t have any value columns. For example, consider the following template and table:

[winning_team] defeated [losing_team] this week

winning_team	losing_team
Liberty	Mystics
Aces	Sky
Lynx	Fever
Storm	Mercury
Sun	Dream
Aces	Mercury
Sky	Fever
Lynx	Mystics
Wings	Sparks

Neither column may be regarded as the table’s key, since a team can defeat multiple opponents or be defeated by multiple opponents. Therefore, the key consists of both columns. By the definition of a key, a table’s template is necessarily indivisible if its key consists of all of its columns.

Summary

The lessons of the preceding sections may be distilled into one main idea:

🔎

Ensuring the indivisibility of a relation’s template is the same as ensuring that the table has at most one value column.

For any number $n$ greater than 1, a table with $n$ value columns can be split into $n$ relations with one value column each.

Things Not Strings

The modeling demonstrated in the examples in the Indivisibility of Facts section is subject to a few criticisms:

The driver’s license table uses each person’s name as a unique identifier for the person. This technique only works if it can be ensured that no two people have the same name, and that is not the case in most practical situations.
The four appraisals are identified using the surrogate keys (opens in a new tab) 1, 2, 3, and 4. While this approach works, it isn’t entirely satisfactory because it can be difficult when viewing the contents of tables to recognize that a value like 3 refers to an appraisal.
Monetary values like $860M are represented using strings. This makes them difficult to do arithmetic with. Representing them with numbers is not entirely satisfactory either, because information like the currency denomination is lost.

In short, the elements in a database should ideally be represented in a way that encodes their meaning. This idea was popularized in 2012 via the blog post Introducing the Knowledge Graph: Things, Not Strings (opens in a new tab).

Identity

The idea of meaning in a database refers in large part to the ability to establish connections that involve multiple relationships. If you know that Norah Jones’ father is Ravi Shankar and Anoushka Shankar’s father is also Ravi Shankar, then you can say that Norah Jones and Anoushka Shankar are siblings. If two users of a restaurant recommendation site like many of the same restaurants, then you can draw the conclusion that they have a high preference similarity.

A prerequisite for being able to establish connections in a database is the ability to uniquely identify the real-world objects being represented. For example, if you only know that a person named Norah Jones has a father named Ravi Shankar, and similarly for Anoushka Shankar, then you can no longer draw any conclusion about the relationship between Norah and Anoushka.

Two people in conversation may have an understanding that the Ravi Shankar in question is the only one who is a famous sitarist, in which case they can draw the conclusion that Norah and Anoushka are siblings. The database, likewise, should be afforded the ability to determine whether two things are the same.

The RDF solution to this problem is to use agreed-upon universal identifiers called IRIs (Internationalized Resource Identifiers) to distinguish things. For example, Tim Berners-Lee — the inventor of the World Wide Web — can be identified in any RDF triple store using the string https://www.w3.org/People/Berners-Lee.

The LPG solution is to supply each node and edge inserted into a database with a unique ID, either using an automatically incremented counter or a randomly generated universally unique ID. With this approach, every node and edge can always be distinguished from any other. On the other hand, this approach doesn’t help with tasks like merging two databases, since the two databases may use different IDs for the same objects.

Things Not Strings in Rel

One key aspect to Rel’s approach to the things-not-strings problem is a pair of constructs called entity types and value types. Instead of storing plain strings or integers in the database, you can store elements that carry additional meaning. Here’s an example showing one way of handling the driver’s license data using entities for the people and value types for the other elements:

// model
 
def Unit = {:cm; :inch}
value type Distance {(Unit, Number)}
 
value type EyeColor = {
  "amber"; "blue"; "brown"; "gray"; "green"; "hazel"
}
 
value type Sex {"F"; "M"; "X"}
 
value type Address { String }
 
def ^Person { make_entity_hash[:Person,  {(String, std::datetime::Date)}] }

In this code cell, the type Distance is defined so that it consists of a unit — either centimeters or inches — and a number. The type EyeColor is defined by explicitly enumerating its six possible values, and similarly for Sex. The type Address is defined in terms of a single string.

The Person type is different from the others, because a person has an identity which persists despite changes in identifying data. Therefore, an entity type is used for Person. The entity type declaration defines a constructor ^Person that can be used to construct Person instances, as illustrated below. See Entities for more details.

The following code block uses name and birthdate as identifying data to define person entities for Alice, Beatrice, and Carlos. It also creates base relations to map these person entities to both name and date of birth:

// write query
 
def person_ids = {
  "Alice", 1996-04-07;
  "Beatrice", 1975-08-11;
  "Carlos", 1981-11-06
}
 
def insert:person:name = {
  (^Person[name, dob], name) from name, dob where person_ids(name, dob)
}
 
def insert:person:dob = {
  (^Person[name, dob], dob) from name, dob where person_ids(name, dob)
}

Lastly, we can use the relations above to create a Graph Normal Form version of the table:

// write query
 
def source:data = """
person,date_of_birth,address,eye_color,sex,height,organ_donor
Alice,1996-04-17,13 Birch Street,brown,F,165 cm,true
Beatrice,1975-08-11,9 Summer Street,brown,F,162 cm,false
Carlos,1981-11-06,4 Florence Ave,hazel,M,170 cm,true
"""
 
def csv = load_csv[source]
 
def key_person(key, p) {
    csv:person(key, name)
    and person:name(p, name)
    from name
}
 
module properties
    def address(p, a) {
        key_person(key, p)
        and csv:address(key, address_string)
        and ^Address(address_string, a)
        from key, address_string
    }
 
    def eye_color(p, c) {
        key_person(key, p)
        and csv:eye_color(key, eye_color_string)
        and ^EyeColor(eye_color_string, c)
        from key, eye_color_string
    }
 
    def sex(p, s) {
        key_person(key, p)
        and csv:sex(key, sex_string)
        and ^Sex(sex_string, s)
        from key, sex_string
    }
 
    def height(p, h) {
        key_person(key, p)
        and csv:height(key, height_string)
        and ^Distance(:cm, parse_float[string_split[" ", height_string][1]], h)
        from key, height_string
    }
 
    def organ_donor(p) {
        key_person(key, p)
        and csv:organ_donor(key, "true")
        from key
    }
end
 
def insert:person = properties

// read query
 
def output(field, name, value) = {
    person:name(p, name)
    and person(field, p, value)
    from p
}

This approach has several advantages:

Representing eye color using an EyeColor value type ensures that eye color values are not mistaken for something else in the database. For example, hazel as a person’s name isn’t the same as hazel the eye color.
Representing length measures in a way that incorporates units helps avoid arithmetic errors involving unit conversions.
While value types are determined by their associated data, entities in the real world are not. Changing a person’s name does not make them a different person, for example. Representing people as entities in the language makes it possible to respect this aspect of real-world identity. For example, if a person were to change their name, the database should continue to identify that person using the ID that was initially created for them, but the relation person:name should be updated to reflect their new name.

Advantages of Graph Normal Form

Graph Normal Form offers several advantages over traditional 3NF tables.

Semantic Stability

Consider the 3NF table discussed in the previous section:

person	date_of_birth	address	eye_color	sex	height	organ_donor
Alice	1996-04-17	13 Birch Street	brown	F	165 cm	true
Beatrice	1975-08-11	9 Summer Street	brown	F	162 cm	false
Carlos	1981-11-06	4 Florence Ave	hazel	M	170 cm	true

If the Department of Motor Vehicles wishes to keep records on your previous addresses as well as your current address, they can’t store that information in a table like this one. Instead, the standard solution is to remove address from this table and put it in its own table:

person	date_of_birth	eye_color	sex	height	organ_donor
Alice	1996-04-17	brown	F	165 cm	true
Beatrice	1975-08-11	brown	F	162 cm	false
Carlos	1981-11-06	hazel	M	170 cm	true

person	address
Alice	13 Birch Street
Alice	1100 Davis Drive
Beatrice	9 Summer Street
Carlos	4 Florence Ave

Separating the tables is essential because it is important to avoid having to repeat information for all of the other fields when there are two address rows for the same person.

The new address relationship is called a many-to-many relationship, since each person may live at multiple addresses, and each address may have multiple residents. The original schema is based on the assumption that the relationship is many-to-one. In other words, the assumption was that each person lives at a single address.

Having to make schema changes to turn a one-to-many relationship into a many-to-many relationship can pose significant problems because queries which have been written for the database must be changed to accommodate the new schema. There may be many such queries in use, so updating them all might require quite a lot of work. Because GNF decouples columns as much as possible already, it affords protection against such breaking changes.

Missing Values

One advantage of using Graph Normal Form is that it eliminates the need to store missing values. Suppose, for example, that Beatrice’s date of birth is unknown.² This might be indicated by leaving a blank in that position in the table:

person	date_of_birth	address	eye_color	sex	height	organ_donor
Alice	1996-04-17	13 Birch Street	brown	F	165 cm	true
Beatrice		9 Summer Street	brown	F	162 cm	false
Carlos	1981-11-06	4 Florence Ave	hazel	M	170 cm	true

In Graph Normal Form, there is no need to do that, since the corresponding date-of-birth row can be dropped:

field	person	value
date_of_birth	Alice	1996-04-17
date_of_birth	Carlos	1981-11-06
address	Alice	13 Birch Street
address	Beatrice	9 Summer Street
address	Carlos	4 Florence Ave
eye_color	Alice	brown
eye_color	Beatrice	brown
eye_color	Carlos	hazel
⋮	⋮	⋮

Since many errors arise from the often surprising behavior of the three-valued logic (opens in a new tab) that some relational database management systems must implement because of their support for null values, which represent empty cells, it’s easier to write transparent and predictable queries for data in Graph Normal Form.

Edge Metadata

As discussed in the section on RDF and Labeled Property Graphs, it is often helpful to record auxiliary information about specific facts in a database. For example, you might want to store a date on which the fact became known, a source for the fact, or which language is used to express the fact.

If such metadata need to be recorded for several different facts stored in the same 3NF table, it is quite inconvenient to do that. You would need to create several new columns — one for each column that has some associated metadata — and you would also have to keep track of which auxiliary columns apply to which facts. Relational database management systems are not designed to handle that kind of schema information.

The graph database systems discussed in the section on RDF and Labeled Property Graphs offer better solutions to this problem, but those approaches also have drawbacks. For example, querying an RDF* triple store is more challenging due to the nested nature of the edges. Properties in an LPG are useful for associating metadata with edges in the graph, but because properties are not themselves nodes or edges in the graph, it is not possible to associate metadata with a specific property.

In a GNF collection of relations, recording auxiliary data for each fact is more straightforward since individual facts correspond to individual tuples in a relation. This makes it possible to record metadata about a particular class of facts using an additional column in the corresponding relation. Such a relation is still in Graph Normal Form as long as the metadata column is part of the key. If the metadata column is a value column, you can instead put it in a separate relation.

Conclusion

In summary, Graph Normal Form is a normalized format for relational data which also generalizes traditional graph database formats. A collection of relations is in Graph Normal Form if the following conditions are met:

Normalization. Every relation has at most one non-key column.
Things not strings. Elements in the database are equal if and only if they are meant to refer to the same thing in the real world.

Representing data in Graph Normal Form provides several advantages relative to 3NF tables and other graph data formats. It encourages the expression of reasoning in logical terms using the fundamental relationships in the dataset.

Was this doc helpful?

In a real application, you would use the driver’s license number as a key, since the number uniquely identifies a person. For convenience, this example uses the person’s name instead. This point is discussed more extensively later. ↩
Beatrice’s missing birthday might seem like a problem for the construction of an entity ID for her. However, it is possible to define an additional constructor which just uses her name:
```
// write query
 
entity type Person = String
 
def insert:person:name(p, "Beatrice") = ^Person("Beatrice", p)
```
As long as you check that the ID being generated for a person is not already in use for another person in the database, you can use fewer fields or more fields as necessary to ensure that unique IDs are generated for distinct entities in the database. You can use an integrity constraint to perform this check. ↩

Data Modeling: Graph Normal Form

Introduction

Indivisibility of Facts

Wide Versus Tall Tables

Stacking 3NF Tables

Graphs

RDF and Labeled Property Graphs

Other Kinds of Tables

A Compound Key with Multiple Value Columns

No Value Columns

Summary

Things Not Strings

Identity

Things Not Strings in Rel

Advantages of Graph Normal Form

Semantic Stability

Missing Values

Edge Metadata

Conclusion

Footnotes