REL
CONCEPTS
Data Modeling: Graph Normal Form

# Data Modeling: Graph Normal Form

This concept guide explains how to work with data in Graph Normal Form.

## Introduction

RAI’s Relational Knowledge Graph Management System allows users to work with data stored in a highly normalized format that RelationalAI calls Graph Normal Form (GNF). This article describes GNF and explains how to perform common operations with GNF relations.

The next two sections introduce the two requirements for a database to be considered Graph Normal Form:

1. Indivisibility of Facts. Each tuple in each relation must correspond to a single, indivisible fact. In other words, each relation must have at most one value column.
2. Things Not Strings. The identity of each element in the database must reflect the identity of the concept it represents. In other words, elements in the database should be equal if and only if they are intended to represent the same real-world object.

The remaining sections discuss the advantages of Graph Normal Form and illustrate how to use Rel to work with data in Graph Normal Form.

## Indivisibility of Facts

### Wide Versus Tall Tables

Suppose that a company wants to record how many nights of hotel accommodation it books for each of its salespersons in each month. One approach is to put the data into a table whose columns correspond to months:

nameJanFebMarApr
Alice3649
Beatrice52101
Carlos4453

As discussed in the Relational Data Modeling article, Relational Database Management Systems (RDBMSs) aren’t designed to handle tables in this format.

The problem is that the schema causes the number of columns to grow in an open-ended way, and RDBMSs aren’t designed to accommodate very large numbers of columns or routine changes to the number of columns. They are designed to handle large numbers of rows and frequent row insertions.

One way to solve this problem is to consider the table’s template:

[name] stayed in a hotel [Jan] nights in January, [Feb] nights in February, [Mar] nights in March, [Apr] nights in April, …

Each row in the table provides many facts, so its template is a highly compound and unwieldy. Splitting the template into individual facts makes it much nicer:

[name] stayed in a hotel [nights] nights in January. [name] stayed in a hotel [nights] nights in February. [name] stayed in a hotel [nights] nights in March. [name] stayed in a hotel [nights] nights in April.

This collection of templates can be simplified further by treating the month as another field:

[name] stayed in a hotel [nights] nights in [month].

This three-field template is adequate to capture all of the rows that will ever be added to this table. Furthermore, this template is indivisible. In other words, a fact of the form [name] stayed in a hotel [nights] nights in [month] cannot be subdivided into multiple simpler facts.

This transformation from many columns to three columns is called stacking from wide form to tall form. Press the next button three times to visualize this transformation:

Some operations, like summing across rows in the wide table, may seem more difficult in the tall table. However, you can get the data from a row in the original table by filtering the rows of the tall table based on the name field. Since RDBMSs are designed to filter rows and sum along columns efficiently — and aren’t set up to sum along rows conveniently — the tall version is actually an improvement over the wide version even in this respect.

### Stacking 3NF Tables

The following table of driver’s license data is a typical Third Normal Form (3NF) table:

Alice1996-04-1713 Birch StreetbrownF165 cmtrue
Beatrice1975-08-119 Summer StreetbrownF162 cmfalse
Carlos1981-11-064 Florence AvehazelM170 cmtrue

The table has one key column — person1 — and six value columns.

While the table does have several columns, the number of columns is not poised to increase indefinitely because driver’s license formats seldom change. Traditional RDMBSs are designed to efficiently handle tables like this one.

On the other hand, this table’s predicate is still compound:

[person] was born on [date_of_birth] and lives at [address] and has eye color [eye_color] and is [sex] and has a height of [height] and [ is | is not ] an organ donor.

This complex template may be distilled into a three-field template, following the approach detailed in the Wide Versus Tall Tables section:

The [field] field of [person]‘s driver’s license says [value].

The table for this simplified template looks like this:

fieldpersonvalue
date_of_birthAlice1996-04-17
date_of_birthBeatrice1975-08-11
date_of_birthCarlos1981-11-06
eye_colorAlicebrown
eye_colorBeatricebrown
eye_colorCarloshazel
sexAliceF
sexBeatriceF
sexCarlosM
heightAlice165 cm
heightBeatrice162 cm
heightCarlos170 cm
organ_donorAlicetrue
organ_donorBeatricefalse
organ_donorCarlostrue

The second and third columns in this table form a collection of six two-column relations, one for each of the six columns in the original 3NF table. The names of these six relations — which are column headers in the original 3NF relations — are indicated in the first column of the stacked table.

Here’s the same table with different styling to make this structure easier to see:

This relation may be represented in Rel using a module:

// read query

def date_of_birth = {
"Alice", 1996-04-07;
"Beatrice", 1975-08-11;
"Carlos", 1981-11-06;
}
"Alice", "13 Birch Street";
"Beatrice", "9 Summer Street";
"Carlos", "4 Florence Ave";
}
def eye_color = {
"Alice", "brown";
"Beatrice", "brown";
"Carlos", "hazel";
}
def sex = {
"Alice", "F";
"Beatrice", "F";
"Carlos", "M";
}
def height = {
"Alice", 165;
"Beatrice", 162;
"Carlos", 170;
}
def organ_donor = {
"Alice", true;
"Beatrice", false;
"Carlos", true;
}
end

This syntax parallels the styled version of the table above since each relation name — date_of_birth, address, eye_color, and so on — appears only once.

### Graphs

A table whose columns correspond to field, key, and value may alternatively be visualized as follows:

1. Depict the each key or value element as a point.
2. For each row, draw a line from the point corresponding to the second element to the point corresponding to the third element.
3. Distinguish the lines by color based on the value in the first column:

The points are also called nodes or vertices, and the lines connecting them are called relationships or edges. A collection of nodes and relationships is called a graph.

Because of the close relationship between the stacked relation above and the graph visualization, the stacked relation is called the Graph Normal Form representation of the original 3NF table.

#### RDF and Labeled Property Graphs

There are two popular ways of organizing graph databases. The one depicted in the graph above is called RDF, which stands for Resource Description Framework. The typical format for a table of RDF triples is subject-predicate-object:

subjectpredicateobject
Alicehas_date_of_birth1996-04-17
Beatricehas_date_of_birth1975-08-11
Carloshas_date_of_birth1981-11-06
Alicehas_eye_colorbrown
Beatricehas_eye_colorbrown
Carloshas_eye_colorhazel
Alicehas_sexF
Beatricehas_sexF
Carloshas_sexM
Alicehas_height165 cm
Beatricehas_height162 cm
Carloshas_height170 cm
Aliceis_organ_donortrue
Beatriceis_organ_donorfalse
Carlosis_organ_donortrue

A collection of RDF triples is called an RDF triple store.

This approach is very simple, but adhering strictly to the three-column format does require substantial refactoring whenever data need to be associated with an existing row. For example, adding information about the date on which Alice moved to 13 Birch Street would require reifying the notion of a move:

subjectpredicateobject
move_1occurred_on2013 June 01

Reification is reasonable in many situations, but having to reify every time you need to add information about an edge can get quite cumbersome, since it introduces further layers of connections that must be navigated when performing queries. For example, finding Alice’s address in the reified schema requires making the connection through move_1, whereas in the original schema that information was available in a single row.

An extension to RDF called RDF* (RDF-star) has been proposed to alleviate this problem. RDF* allows triples to be used as a subject or object in other triples. For example, Alice’s move date could be added to the database above by inserting one additional triple:

<< Alice
"13 Birch Street" >>
:as_of
2013-06-01

Note that the << >> syntax is used for grouping in this notation.

A triple that has a triple as a node can, in turn, be used as a subject or object in another triple. For example, suppose Alice waited a month to provide notification of her move. The following triple records the fact that Alice’s move became known as of the first of July:

<< << Alice
"13 Birch Street" >>
as_of
2013-06-01 >>
:known_since
2013-07-01

Labeled Property Graphs offer another solution to the problem: They allow you to associate an arbitrary collection of key-value pairs to any node or edge in the graph. These associated key-value pairs are called properties.

One important distinction between the RDF and LPG approaches is that properties in a Labeled Property Graph are auxiliary data, not nodes in the graph. This makes it more difficult to establish any connections that may exist between properties and other nodes or edges in the graph.

Because Rel allows users to construct arbitary relations, its data model is more general than RDF or LPG. Graphs in any of these formats — RDF, RDF*, or LPG — can be modeled in Rel.

### Other Kinds of Tables

This section addresses some other situations which arise when dealing with tabular data.

#### A Compound Key with Multiple Value Columns

The 3NF table in the Stacking 3NF Tables section has one key column and several value columns. Here’s a table with more than one key column:

paintingappraiservaluationlocation
Mona LisaAbbott$860MNew York Mona LisaBartle$920MBerlin
Starry NightBartle$120MTokyo SunflowersBartle$45MLondon

The template corresponding to this table is as follows:

[painting] was appraised by [appraiser] for [valuation] in [location].

Assuming that each appraiser appraises each painting at most once, the key of this table consists of the painting and appraiser columns, while valuation and location are value columns.

The template is not indivisible since it can be split into two templates as follows:

[painting] was appraised by [appraiser] for [valuation]. [painting] was appraised by [appraiser] in [location].

There are two ways to transform the data to obtain a table whose template is indivisible:

1. Stack the table to get a relation with tuples of the form (field, key₁, key₂, value):

fieldpaintingappraiservalue
valuationMona LisaAbbott$860M valuationMona LisaBartle$920M
valuationStarry NightBartle$120M valuationSunflowersBartle$45M
locationMona LisaAbbottNew York
locationMona LisaBartleBerlin
locationStarry NightBartleTokyo
locationSunflowersBartleLondon
2. Introduce a second relation that reifies the compound key, which in this case corresponds to the real-world concept of an appraisal. Suppose that the appraisals are identified using the numbers 1, 2, 3, and 4, for example:

fieldappraisalvalue
painting1Mona Lisa
painting2Mona Lisa
painting3Starry Night
painting4Sunflowers
appraiser1Abbott
appraiser2Bartle
appraiser3Bartle
appraiser4Bartle
valuation1$860M valuation2$920M
valuation3$120M valuation4$45M
location1New York
location2Berlin
location3Tokyo
location4London

In Rel the second approach is generally preferred because it provides more query flexibility, but either approach may be used.

#### No Value Columns

Some relations don’t have any value columns. For example, consider the following template and table:

[winning_team] defeated [losing_team] this week

winning_teamlosing_team
LibertyMystics
AcesSky
LynxFever
StormMercury
SunDream
AcesMercury
SkyFever
LynxMystics
WingsSparks

Neither column may be regarded as the table’s key, since a team can defeat multiple opponents or be defeated by multiple opponents. Therefore, the key consists of both columns. By the definition of a key, a table’s template is necessarily indivisible if its key consists of all of its columns.

#### Summary

The lessons of the preceding sections may be distilled into one main idea:

🔎

Ensuring the indivisibility of a relation’s template is the same as ensuring that the table has at most one value column.

For any number $n$ greater than 1, a table with $n$ value columns can be split into $n$ relations with one value column each.

## Things Not Strings

The modeling demonstrated in the examples in the Indivisibility of Facts section is subject to a few criticisms:

1. The driver’s license table uses each person’s name as a unique identifier for the person. This technique only works if it can be ensured that no two people have the same name, and that is not the case in most practical situations.
2. The four appraisals are identified using the surrogate keys (opens in a new tab) 1, 2, 3, and 4. While this approach works, it isn’t entirely satisfactory because it can be difficult when viewing the contents of tables to recognize that a value like 3 refers to an appraisal.
3. Monetary values like \$860M are represented using strings. This makes them difficult to do arithmetic with. Representing them with numbers is not entirely satisfactory either, because information like the currency denomination is lost.

In short, the elements in a database should ideally be represented in a way that encodes their meaning. This idea was popularized in 2012 via the blog post Introducing the Knowledge Graph: Things, Not Strings (opens in a new tab).

### Identity

The idea of meaning in a database refers in large part to the ability to establish connections that involve multiple relationships. If you know that Norah Jones’ father is Ravi Shankar and Anoushka Shankar’s father is also Ravi Shankar, then you can say that Norah Jones and Anoushka Shankar are siblings. If two users of a restaurant recommendation site like many of the same restaurants, then you can draw the conclusion that they have a high preference similarity.

A prerequisite for being able to establish connections in a database is the ability to uniquely identify the real-world objects being represented. For example, if you only know that a person named Norah Jones has a father named Ravi Shankar, and similarly for Anoushka Shankar, then you can no longer draw any conclusion about the relationship between Norah and Anoushka.

Two people in conversation may have an understanding that the Ravi Shankar in question is the only one who is a famous sitarist, in which case they can draw the conclusion that Norah and Anoushka are siblings. The database, likewise, should be afforded the ability to determine whether two things are the same.

The RDF solution to this problem is to use agreed-upon universal identifiers called IRIs (Internationalized Resource Identifiers) to distinguish things. For example, Tim Berners-Lee — the inventor of the World Wide Web — can be identified in any RDF triple store using the string https://www.w3.org/People/Berners-Lee.

The LPG solution is to supply each node and edge inserted into a database with a unique ID, either using an automatically incremented counter or a randomly generated universally unique ID. With this approach, every node and edge can always be distinguished from any other. On the other hand, this approach doesn’t help with tasks like merging two databases, since the two databases may use different IDs for the same objects.

### Things Not Strings in Rel

One key aspect to Rel’s approach to the things-not-strings problem is a pair of constructs called entity types and value types. Instead of storing plain strings or integers in the database, you can store elements that carry additional meaning. Here’s an example showing one way of handling the driver’s license data using entities for the people and value types for the other elements:

// model

def Unit = {:cm; :inch}
value type Distance = Unit, Number

value type EyeColor = {
"amber"; "blue"; "brown"; "gray"; "green"; "hazel"
}

value type Sex = {"F"; "M"; "X"}

entity type Person = String, Date

In this code cell, the type Distance is defined so that it consists of a unit — either centimeters or inches — and a number. The type EyeColor is defined by explicitly enumerating its six possible values, and similarly for Sex. The type Address is defined in terms of a single string.

The Person type is different from the others, because a person has an identity which persists despite changes in identifying data. Therefore, an entity type is used for Person. The entity type declaration defines a constructor ^Person that can be used to construct Person instances, as illustrated below. See the Entities Concept Guide for more details.

The following code block uses name and birthdate as identifying data to define person entities for Alice, Beatrice, and Carlos. It also creates base relations to map these person entities to both name and date of birth:

// write query

def person_ids = {
"Alice", 1996-04-07;
"Beatrice", 1975-08-11;
"Carlos", 1981-11-06
}

def insert:person:name = {
(^Person[name, dob], name) from name, dob where person_ids(name, dob)
}

def insert:person:dob = {
(^Person[name, dob], dob) from name, dob where person_ids(name, dob)
}

Lastly, we can use the relations above to create a Graph Normal Form version of the table:

// write query

def source:data = """
Alice,1996-04-17,13 Birch Street,brown,F,165 cm,true
Beatrice,1975-08-11,9 Summer Street,brown,F,162 cm,false
Carlos,1981-11-06,4 Florence Ave,hazel,M,170 cm,true
"""

def key_person(key, p) {
csv:person(key, name)
and person:name(p, name)
from name
}

module properties
key_person(key, p)
}

def eye_color(p, c) {
key_person(key, p)
and csv:eye_color(key, eye_color_string)
and ^EyeColor(eye_color_string, c)
from key, eye_color_string
}

def sex(p, s) {
key_person(key, p)
and csv:sex(key, sex_string)
and ^Sex(sex_string, s)
from key, sex_string
}

def height(p, h) {
key_person(key, p)
and csv:height(key, height_string)
and ^Distance(:cm, parse_float[string_split[" ", height_string][1]], h)
from key, height_string
}

def organ_donor(p) {
key_person(key, p)
and csv:organ_donor(key, "true")
from key
}
end

def insert:person = properties
// read query

def output(field, name, value) = {
person:name(p, name)
and person(field, p, value)
from p
}

1. Representing eye color using an EyeColor value type ensures that eye color values are not mistaken for something else in the database. For example, hazel as a person’s name isn’t the same as hazel the eye color.
2. Representing length measures in a way that incorporates units helps avoid arithmetic errors involving unit conversions.
3. While value types are determined by their associated data, entities in the real world are not. Changing a person’s name does not make them a different person, for example. Representing people as entities in the language makes it possible to respect this aspect of real-world identity. For example, if a person were to change their name, the database should continue to identify that person using the ID that was initially created for them, but the relation person:name should be updated to reflect their new name.

## Advantages of Graph Normal Form

#### Semantic Stability

Consider the 3NF table discussed in the previous section:

Alice1996-04-1713 Birch StreetbrownF165 cmtrue
Beatrice1975-08-119 Summer StreetbrownF162 cmfalse
Carlos1981-11-064 Florence AvehazelM170 cmtrue

If the Department of Motor Vehicles wishes to keep records on your previous addresses as well as your current address, they can’t store that information in a table like this one. Instead, the standard solution is to remove address from this table and put it in its own table:

persondate_of_birtheye_colorsexheightorgan_donor
Alice1996-04-17brownF165 cmtrue
Beatrice1975-08-11brownF162 cmfalse
Carlos1981-11-06hazelM170 cmtrue
Alice13 Birch Street
Alice1100 Davis Drive
Beatrice9 Summer Street
Carlos4 Florence Ave

Separating the tables is essential because it is important to avoid having to repeat information for all of the other fields when there are two address rows for the same person.

The new address relationship is called a many-to-many relationship, since each person may live at multiple addresses, and each address may have multiple residents. The original schema is based on the assumption that the relationship is many-to-one. In other words, the assumption was that each person lives at a single address.

Having to make schema changes to turn a one-to-many relationship into a many-to-many relationship can pose significant problems because queries which have been written for the database must be changed to accommodate the new schema. There may be many such queries in use, so updating them all might require quite a lot of work. Because GNF decouples columns as much as possible already, it affords protection against such breaking changes.

#### Missing Values

One advantage of using Graph Normal Form is that it eliminates the need to store missing values. Suppose, for example, that Beatrice’s date of birth is unknown.2 This might be indicated by leaving a blank in that position in the table:

Alice1996-04-1713 Birch StreetbrownF165 cmtrue
Beatrice9 Summer StreetbrownF162 cmfalse
Carlos1981-11-064 Florence AvehazelM170 cmtrue

In Graph Normal Form, there is no need to do that, since the corresponding date-of-birth row can be dropped:

fieldpersonvalue
date_of_birthAlice1996-04-17
date_of_birthCarlos1981-11-06
eye_colorAlicebrown
eye_colorBeatricebrown
eye_colorCarloshazel

Since many errors arise from the often surprising behavior of the three-valued logic (opens in a new tab) that some relational database management systems must implement because of their support for null values, which represent empty cells, it’s easier to write transparent and predictable queries for data in Graph Normal Form.

As discussed in the section on RDF and Labeled Property Graphs, it is often helpful to record auxiliary information about specific facts in a database. For example, you might want to store a date on which the fact became known, a source for the fact, or which language is used to express the fact.

If such metadata need to be recorded for several different facts stored in the same 3NF table, it is quite inconvenient to do that. You would need to create several new columns — one for each column that has some associated metadata — and you would also have to keep track of which auxiliary columns apply to which facts. Relational database management systems are not designed to handle that kind of schema information.

The Graph database systems discussed in the section on RDF and Labeled Property Graphs offer better solutions to this problem, but those approaches also have drawbacks. For example, querying an RDF* triple store is more challenging due to the nested nature of the edges. Properties in an LPG are useful for associating metadata with edges in the graph, but because properties are not themselves nodes or edges in the graph, it is not possible to associate metadata with a specific property.

In a GNF collection of relations, recording auxiliary data for each fact is more straightforward since individual facts correspond to individual tuples in a relation. This makes it possible to record metadata about a particular class of facts using an additional column in the corresponding relation. Such a relation is still in Graph Normal Form as long as the metadata column is part of the key. If the metadata column is a value column, you can instead put it in a separate relation.

## Conclusion

In summary, Graph Normal Form is a normalized format for relational data which also generalizes traditional graph database formats. A collection of relations is in Graph Normal Form if the following conditions are met:

1. Normalization. Every relation has at most one non-key column.
2. Things not strings. Elements in the database are equal if and only if they are meant to refer to the same thing in the real world.

Representing data in Graph Normal Form provides several advantages relative to 3NF tables and other graph data formats. It encourages the expression of reasoning in logical terms using the fundamental relationships in the dataset.

## Footnotes

1. In a real application, you would use the driver’s license number as a key, since the number uniquely identifies a person. For convenience, this example uses the person’s name instead. This point is discussed more extensively later in the article.

2. Beatrice’s missing birthday might seem like a problem for the construction of an entity ID for her. However, it is possible to define an additional constructor which just uses her name:

// write query

entity type Person = String

def insert:person:name(p, "Beatrice") = ^Person("Beatrice", p)

As long as you check that the ID being generated for a person is not already in use for another person in the database, you can use fewer fields or more fields as necessary to ensure that unique IDs are generated for distinct entities in the database. You can use an integrity constraint to perform this check.