Relational Data Modeling

Learn how to use relations to model data.

Introduction

Models

Physical Models

A model train is a physical object designed to be similar in some respects to a real train. The similarities between the model and the train are useful for drawing conclusions about the train based on features of the model. For example, an observer may conclude that the train car being represented by this model has several windows:

model train

In general, a physical model consists of three parts:

  1. The original object.
  2. The model object.
  3. A correspondence between some aspects of the original and aspects of the model.

For example, the color of the model, the shapes of its windows, and the letters on the side match corresponding features of the original. Other features of the original, like its size, material composition, and internal engineering details, do not have corresponding elements in the model.

A model train’s value comes from both similarities and differences to the original train. Its simplified internal structure makes it more affordable than a real train, and its small size makes it suitable for display in a home or museum. In general, people who build models aim to achieve an appropriate balance between fidelity and practicality.

Data Models

Real trains and model trains are both made of physical material, but not all models are physical. Many models used in day-to-day life are conceptual models of objects, phenomena, or situations.

For example, suppose you have been invited to a wedding out of town and are looking to book accommodation for the evening. A typical first step would be to collect information about hotels in the area and organize it in a table:

namestarsrategym
Courtyard3$159true
Grand Bohemian4$346true
Residence Inn3$154false
DoubleTree3$122false
Holiday Inn3$128true
Clarion Inn2$77false

While you could say that your table is a model made of ink or pixels, it is more helpful to think of it as presentation of a model made of concepts. Each hotel is represented in your model as a collection which includes a string, a number, a monetary amount, and a true/false value.

Collections, strings, numbers, monetary amounts, true, and false are all concepts rather than physical objects, so this model is indeed conceptual rather than physical. Furthermore, strings, numbers, monetary amounts, true/false values are all data, so this model may be described more specifically as a data model1.

The elements of your data model are similar to actual hotels in only certain respects. Not only is plenty of information missing from the model (parking, laundry, mattress sizes, etc.), but a real hotel is a place where you can actually spend the night. The same would not be true of a data model no matter how much information it represented. Nevertheless, the correspondence between the hotel model and the hotel is rich enough to help you with your search.

In general, a data model comprises three parts:

  1. A real-world object or phenomenon to be modeled.
  2. The conceptual structure of the data being used as a model.
  3. The set of rules defining the correspondence between the values in the model and features of the object or phenomenon being modeled.

Each hotel in the table is represented as a collection of four values. The table itself represents a set of such four-value collections. These aspects of the model constitute its conceptual–or mathematical–structure.

Furthermore, detailed information about the meaning of each column is needed to interpret the values in this table or to add a new row to the table. For example, perhaps the first value is the name of the hotel as it appears on Google Maps, second value is the star rating reported by AAA, and so on. These rules describe how the data elements correspond to things in the real world.

Relation Terminology

Tables and Relations

The table provides a convenient way to visualize the data model you are using for your hotel search, but it has some features which are not essential to the model. For example, the data model would remain the same if you changed cosmetic features like the colors of the cells or the alignment of the text within each cell. Likewise, rearranging the rows would not change the data model, because no meaning has been assigned to the order of the rows. You could even dispense with the table altogether and write out the rows in plain text, perhaps like this:

- name: Courtyard
  stars: 3
  rate: $159
  gym: true
- name: Grand Bohemian
  stars: 4
  rate: $346
  gym: true
- name: Residence Inn
  stars: 3
  rate: $154
  gym: false
- name: DoubleTree
  stars: 3
  rate: $122
  gym: false
- name: Holiday Inn
  stars: 3
  rate: $128
  gym: true
- name: Clarion Inn
  stars: 2
  rate: $77
  gym: false

To help distinguish the essential features of the table from inessential ones, it’s customary to use mathematical vocabulary rather than table-specific vocabulary to describe the conceptual objects that make up the model. The four-element collections representing each hotel are called tuples. A set of tuples is called a relation. In other words, every row in the table represents a tuple, and the entire table represents a relation.

Reordering a table’s rows, for example, constitutes a change to the table, but it does not change the underlying relation. Removing a column from a table would change both the table and the relation it represents.

Templates and Fields

One helpful way to communicate how a relation should be interpreted is to supply a sentence template with blanks to be filled in with the elements of each tuple. For example, the following would be an appropriate template for the relation above:

[hotel] is a [stars]-star hotel with a nightly rate of [rate] dollars, and it [has | does not have] a gym.

The blanks in the template are called fields. Each field may be identified using a name, and the field names may be used as column headers in a tabular representation of the relation. For example, the third blank is the rate field in this relation.

Databases

A typical application involves several different kinds of facts. For example, if you want to investigate each hotel’s reviews in more detail, you may want a relation whose template is:

[hotel] received a [rating]-star review from [user].

The table for this template looks like this:

hotelratinguser
Courtyard5Jennifer Lanza
Courtyard5Taylor Glass
Grand Bohemian5Hannah Price
Residence Inn4Chris Colotti
Residence Inn2Nicholas Brown
Residence Inn3Kristine Fitts
Clarion Inn3Travis Kisner

A collection of one or more relations is called a relational database. The relations' tuples are called the database’s data. Everything else, including relation names, field names, and any other information that may be used to interpret or manage the data, is called the database’s schema.

Only a few rows are shown in the review table above because the actual relation would have thousands of tuples. While you could use pencil-and-paper approach for the original six-row relation, you would need to use software to analyze the much larger review relation. An application designed to robustly store and query relational databases is called a Relational Database Management System (RDBMS). Popular RDBMSs include PostgreSQL, Oracle, Snowflake, and BigQuery.

Normal Forms

Relational database design involves more than arranging just data into tables. For example, spreadsheet applications are built around the idea of creating tables, but their high degree of flexibility means that users must exercise care to ensure that their sheets are well-organized and amenable to computation.

This section introduces some of the most important principles for designing schemas which are conceptually clear and which make it easy for RDBMSs to perform computations effectively.

First Normal Form

Suppose you want to put the ratings into the hotel table. One way to do that is to create a new column called review whose cells store a list of rating values:

namestarsrategymratings
Courtyard3$159true[5, 5]
Grand Bohemian4$346true[5]
Residence Inn3$154false[4, 2, 3]
DoubleTree3$122false[]
Holiday Inn3$128true[]
Clarion Inn2$77false[3]

For simplicity, this table only reflects the rows shown in the reviews table above. If the actual reviews table were used instead, each list would contain hundreds of values.

This design is generally regarded as undesirable or impermissible because the possibility of having a list as an entry makes it more difficult to specify queries and reason about the contents of the table.

For example, with the original two-relation schema, the query “which hotels have at least one two-star review?” is a relation-level operation that can be expressed and computed efficiently with any RDBMS.

With the one-relation schema containing ratings in a list, the query mixes cell-level calculations – searching for a value in a list – with relation-level calculations – identifying tuples which meet a condition. While some RDBMSs, including PostgreSQL, do allow this behavior, it comes at the cost of considerable complexity because database users have to learn different techniques for cell-level and relation-level calculations.

In summary, a database schema should require every element of every tuple to contain a single value, like a date, string, or number. A database whose schema satisfies this property is in First Normal Form (1NF).

Putting a table into 1NF usually requires splitting it into multiple tables. In this example, the ratings column must be separated into its own table. Here’s an example of how that may be done:

namestarsrategym
Courtyard3$159true
Grand Bohemian4$346true
Residence Inn3$154false
DoubleTree3$122false
Holiday Inn3$128true
Clarion Inn2$77false
positionhotelrating
1Courtyard5
2Courtyard5
1Grand Bohemian5
1Residence Inn4
2Residence Inn2
3Residence Inn3
1Clarion Inn3

Note: the second table includes the index of each rating within its array because that approach preserves all of the information in the non-1NF table. Other approaches, like identifying each rating by the user who submitted it, are also fine.

Keys and Values

The definitions for all of the normal forms beyond 1NF make reference to relations' keys and value fields, so this section introduces those terms.

Keys

Each row in the hotels table is uniquely determined by the hotel identifier in the first column because each hotel has a specific star rating, nightly rate, and gym status:

namestarsrategym
Courtyard3$159true
Grand Bohemian4$346true
Residence Inn3$154false
DoubleTree3$122false
Holiday Inn3$128true
Clarion Inn2$77false

A field which can be uniquely determined by another field or set of fields is said to be functionally dependent on that field or set of fields. For example, stars, rate, and gym are functionally dependent on name.

In the review table, a row is uniquely identified by its hotel and user fields, since a user is not allowed to provide two different star ratings for the same hotel:

hotelratinguser
Courtyard5Jennifer Lanza
Courtyard5Taylor Glass
Grand Bohemian5Hannah Price
Residence Inn4Chris Colotti
Residence Inn2Nicholas Brown
Residence Inn3Kristine Fitts
Clarion Inn3Travis Kisner

The hotel field does not uniquely identify a row, since each hotel may have multiple reviews. The hotel and rating field together also do not uniquely identify a row, since multiple users may give the same star rating to the same hotel.

A field or set of fields which can be used to uniquely identify tuples in relation is called a key. For example, the name field is the key of the hotels table, while the key of the reviews table consists of the hotel and user fields.2

It is possible for a relation to have more than one key. For example, consider a table of students with a Student ID column as well as a Social Security Number column. Since either of these two numbers uniquely identifies a student, either column can serve as a key. In such cases, a primary key may be chosen and declared.

Values

The fields which are not part of the key are called value fields. The hotels table has three value fields, and the reviews table has one value field.

It is possible for a relation to have no value fields. For example, consider a relation which records the hotel stays that you have paid for:

datehotel
2019-11-10Hotel Hive
2019-12-22Treehouse Hotel London
2019-12-23Treehouse Hotel London
2022-04-18Hilton Garden Inn Denver

This table’s key includes both fields because it’s possible to pay for two hotel bookings for the same date, and it’s possible to pay for multiple evenings at the same hotel.

Third Normal Form

Suppose that you want to consider the loyalty program points that you can get for your wedding hotel stay. Some of the hotels are part of a hotel chain, and each chain has its own loyalty program:

namechainprogram
CourtyardMarriottBonvoy
Residence InnMarriottBonvoy
DoubleTreeHiltonHonors
Holiday InnIHGIHG

The third column in this table is functionally dependent on the second column. In other words, each row’s program value may be determined by looking at its chain value. This introduces the possibility of data inconsistency.

For example, suppose that Marriott changes its rewards program and that only one of the first two rows in this table is correctly updated. Then the database would be unclear on which rewards program Marriott participates in.

If, instead, the chain-program relationship is split into its own table, there is no such potential for inconsistency:

namechain
CourtyardMarriott
Residence InnMarriott
DoubleTreeHilton
Holiday InnIHG
chainprogram
MarriottBonvoy
HiltonHonors
IHGIHG

If an errant change were made under this schema, the data might be incorrect, but it could not be inconsistent. In other words, each fact encoded by this collection of relations is recorded in only one place.

Not all functional dependencies are a problem. The rating field is functionally dependent on hotel and user in the table below, and this does not introduce any potential for inconsistency:

hotelratinguser
Courtyard5Jennifer Lanza
Courtyard5Taylor Glass
Grand Bohemian5Hannah Price
Residence Inn4Chris Colotti
Residence Inn2Nicholas Brown
Residence Inn3Kristine Fitts
Clarion Inn3Travis Kisner

The difference between this example and the undesirable loyalty-program example is that in this case the dependency is on the relation’s key. The dependency of a value field on the key is necessary.

A collection of tables is said to be in Third Normal Form (3NF) if the only functional dependencies are the ones for which a value field in a relation is functionally dependent on that relation’s key. Any other functional dependency should be avoided:

  1. A functional dependency of a value field on another value field or fields.

    Consider a hotel table with a zip-code field and a state field. The state field is functionally dependent on the zip-code field, and both of them are value fields. Therefore, this table is not in 3NF. To put it in 3NF, remove the state field and create a separate relation to record each zip code’s state.

  2. A functional dependency of a value field on some of the key fields but not all of them.

    Consider a review table with fields hotel, rating, user, and user_review_count, with the user_review_count field recording the total number of reviews that user has made on the site. Since user_review_count can be determined by the user field alone, this table is not in 3NF. To put it in 3NF, remove the user_review_count field and create a separate relation to record each user’s review count.

RDBMSs of all kinds encourage users to model data in 3NF.

Graph Normal Form

RelationalAI goes a step beyond most RDBMSs by encouraging schemas in which each relation has at most one value column. In RelationalAI terminology, such a database is said to be in Graph Normal Form3. A database in Third Normal Form may be put into Graph Normal Form by splitting every table with $n$ value columns into $n$ tables (whenever $n$ is greater than $1$):

def stars = {
("Courtyard", 3);
("Grand Bohemian", 4);
("Residence Inn", 3);
("DoubleTree", 3);
("Holiday Inn", 3);
("Clarion Inn", 2);
}

def rate = {
("Courtyard", 159);
("Grand Bohemian", 346);
("Residence Inn", 154);
("DoubleTree", 122);
("Holiday Inn", 128);
("Clarion Inn", 77);
}

def gym = {
("Courtyard", true);
("Grand Bohemian", true);
("Residence Inn", false);
("DoubleTree", false);
("Holiday Inn", true);
("Clarion Inn", false);
}

You can wrap a collection of commonly keyed relations in a module and display it as a table with more than one value column:

module hotel
def stars = {
("Courtyard", 3);
("Grand Bohemian", 4);
("Residence Inn", 3);
("DoubleTree", 3);
("Holiday Inn", 3);
("Clarion Inn", 2);
}

def rate = {
("Courtyard", 159);
("Grand Bohemian", 346);
("Residence Inn", 154);
("DoubleTree", 122);
("Holiday Inn", 128);
("Clarion Inn", 77);
}

def gym = {
("Courtyard", true);
("Grand Bohemian", true);
("Residence Inn", false);
("DoubleTree", false);
("Holiday Inn", true);
("Clarion Inn", false);
}
end

def output = table[hotel]

Relation: output

gymratestars
Courtyardtrue1593
Grand Bohemiantrue3464
Holiday Inntrue1283
Clarion Inn772
DoubleTree1223
Residence Inn1543

Wide Versus Tall Tables

This section addresses one final principle of relational database design.

Suppose that a company wants to record how many nights of hotel accommodation it books for each of its salespersons in each month. One approach is to put the data into a table whose columns correspond to months:

nameJanFebMarApr
Alice3649
Beatrice52101
Carlos4453

On one hand, this method of organization makes a lot of sense: there are two axes – vertical and horizontal – and effectively two keys – person and month. The table makes full use of the available dimensions by using one key per axis.

However, standard relational database design principles disallow this setup because it causes the number of columns to grow in an open-ended way. Monthly changes to this table’s schema would need to be scheduled, and appropriate changes would need to be made for queries that involve this table.

In other words, it should be possible to make routine changes to data without changing the database’s schema, and this approach violates that principle.

Another approach is to use a tall (or long) table that records the same data with just three columns: name, month, and value. In other words, the tall table’s row axis is used for both names and months. This wide-to-tall transformation is sometimes called stacking.

Press the stack button below to visualize this transformation:

The new data that comes in each month can now be added by creating new rows rather than new columns. Furthermore, moving the month names from the schema level to the data level makes it easier to formulate questions like “For which months does the table have data?” or “How many times was a person on the road for more than 10 nights in a month?”

For these reasons, tall tables are considered better relational database design.

Querying Relational Data

Organizing and storing data in relations is only part of what an RDBMS is expected to do. The system must also be able to retrieve data when it’s needed. Since it would be very inefficient in most cases for the system to retrieve entire tables, the system must also provide a query language that users may use to specify which results they want.

There are a variety of relational query languages, and doing justice to any of them would be beyond the scope of this article. So this section discusses the core operations supported by all RDBMSs.4

Filters

One of the reasons it makes sense to store relations in tall rather than wide form is that you can query a relation to obtain all of the rows which meet a given logical condition. For example, suppose you want to see Beatrice’s hotel night counts for months other than January. In Rel you can do that like this:

def hotel_nights = {
("Alice", "Jan", 3);
("Alice", "Feb", 6);
("Alice", "Mar", 4);
("Alice", "Apr", 9);
("Beatrice", "Jan", 5);
("Beatrice", "Feb", 2);
("Beatrice", "Mar", 10);
("Beatrice", "Apr", 1);
("Carlos", "Jan", 4);
("Carlos", "Feb", 4);
("Carlos", "Mar", 5);
("Carlos", "Apr", 3);
}

def output = {
name, month, nights :
hotel_nights(name, month, nights) and
month != "Jan" and
name = "Beatrice"
}

Relation: output

"Beatrice""Apr"1
"Beatrice""Feb"2
"Beatrice""Mar"10

Getting a subset of the rows of a relation based on a logical condition applied to each tuple is called filtering. The ability to perform this operation in a general and efficient way is a powerful feature of RDBMSs because achieving similar functionality in non-relational systems often requires quite a bit of planning and work.

Joins

As illustrated in the examples in the Normal Forms section, splitting a table into two tables with fewer columns can be useful for data modeling. However, if you store related data in separate tables, you have to be able to recombine the data when necessary. This operation is called a join.

As discussed in the previous section, this table is not in Third Normal Form:

namechainprogram
CourtyardMarriottBonvoy
Residence InnMarriottBonvoy
DoubleTreeHiltonHonors
Holiday InnIHGIHG

However, its contents may be put into 3NF by splitting them into these two tables:

namechain
CourtyardMarriott
Residence InnMarriott
DoubleTreeHilton
Holiday InnIHG
chainprogram
MarriottBonvoy
HiltonHonors
IHGIHG

To join these two tables and recover the original table, you can use the second table to look up the loyalty program for each chain. One helpful way to visualize this process is to position the rows of the two tables along the sides of a rectangle5:

Join Visualization

Each square in this figure represents a combination of a row from the first table and a row from the second table, and orange dots have been placed in the squares for which the chain values match for the two rows. Each of these dots gives rise to one row in the joined table.

The interactive figure below provides another visualization for the process of joining these two tables. Press the button to form every possible combination of a row from the first table and a row from the second table (this is called the product of the two tables), and then press the filter button to remove rows that don’t represent a match:

In general, joining two tables entails forming the product of the two tables, then potentially removing some of the resulting rows based on a condition applied to each row.6

One reason it’s useful to use a rectangle to visualize the join process, as opposed to thinking of the join as a lookup operation involving the second table, is that a row in one table may match no rows or multiple rows in the other table. The following figure shows a join of two hotel-chain tables based on whether the hotels belong to the same chain:

Join Visualization

Each of the tables being joined has four rows, but the join has more than four rows because some rows match more than once. One row, namely the last row in the table on the left, doesn’t match any rows in the other table.7

Joins are often combined with a removal of some of the resulting columns. You would want to remove one of the two chain columns in the animated figure above since the two columns are identical. Other columns may also be removed if they are not needed.

Other operations

The other fundamental operations on relations are as follows:

  1. Union. Given two relations with the same fields, compute the relation consisting of all tuples which are in either of the two relations.
  2. Difference. Given two relations with the same fields, compute the relation consisting of all tuples which are in the first relation but not the second.
  3. Renaming. Change the names of relations and column header.
  4. Projection. Remove columns.

There are other operations often available in relational query systems which do not make the standard list of fundamental ones. For example, you can sort the tuples in a relation according to the values in a particular field. The result of such an operation is not really a relation since the order of its tuples must be considered meaningful. Nevertheless, sorting capabilities are typically provided by RDBMSs as a convenience.

Examples

This section illustrates how to use relations to model several common mathematical objects or real-world situations.

Irregular Tables

Many data sets encountered in the wild are organized into one large table, even when there are multiple kinds of facts to communicate. For example, here is the table published by the Massachusetts Bay Transit Authority for the outbound Providence Line train schedule:

StationTrain 801Train 861Train 803Train 865Train 805Train 867Train 807Train 869Train 809Train 871Train 811Train 813Train 873Train 815Train 875Train 817Train 877Train 819Train 879Train 821Train 823Train 881Train 825Train 827Train 883Train 829Train 885Train 831Train 887Train 833Train 889Train 835Train 891Train 837Train 893Train 839
bikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes not allowedbikes not allowedbikes not allowedbikes not allowedbikes not allowedbikes not allowedbikes not allowedbikes not allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowedbikes allowed
South Stationparkingaccessible4:25 AM5:25 AM6:25 AM7:00 AM7:25 AM8:00 AM8:25 AM8:57 AM9:25 AM10:00 AM10:25 AM11:25 AM12:05 PM12:25 PM1:05 PM1:20 PM2:05 PM2:25 PM2:55 PM3:25 PM3:52 PM4:00 PM4:25 PM4:52 PM5:00 PM5:40 PM6:00 PM6:22 PM7:00 PM7:25 PM8:00 PM8:25 PM9:00 PM9:40 PM10:20 PM11:00 PM
Back Bayno parkingaccessible4:30 AM5:30 AM6:30 AM7:05 AM7:30 AM8:05 AM8:30 AM9:02 AM9:30 AM10:05 AM10:30 AM11:30 AM12:10 PM12:30 PM1:10 PM1:25 PM2:10 PM2:30 PM3:00 PM3:30 PM3:57 PM4:05 PM4:30 PM4:57 PM5:05 PM5:45 PM6:05 PM6:27 PM7:05 PM7:30 PM8:05 PM8:30 PM9:05 PM9:45 PM10:25 PM11:05 PM
Rugglesno parkingaccessible4:33 AM5:33 AM6:33 AM7:08 AM7:33 AM8:08 AM8:33 AM9:05 AM9:33 AM10:08 AM10:33 AM11:33 AM12:13 PM12:33 PM1:13 PM1:28 PM2:13 PM2:33 PM3:03 PM3:33 PM4:01 PM4:08 PM4:33 PM5:01 PM5:08 PM5:49 PM6:08 PM6:30 PM7:08 PM7:33 PM8:08 PM8:33 PM9:08 PM9:48 PM10:28 PM11:08 PM
Forest Hillsparkingaccessible11:13 PM
Hyde Parkparkingaccessible9:41 AM10:16 AM11:41 AM12:21 PM1:21 PM2:21 PM8:16 PM9:16 PM10:36 PM11:18 PM
Route 128parkingaccessible4:44 AM5:44 AM6:44 AM7:18 AM7:44 AM8:18 AM8:44 AM9:15 AM9:51 AM10:21 AM10:44 AM11:47 AM12:29 PM12:44 PM1:26 PM1:39 PM2:26 PM2:44 PM3:14 PM3:44 PM4:19 PM4:44 PM5:19 PM6:00 PM6:19 PM6:41 PM7:19 PM7:44 PM8:21 PM8:44 PM9:21 PM9:59 PM10:41 PM11:25 PM
Canton Junctionparkingaccessible4:50 AM5:50 AM6:50 AM7:24 AM7:50 AM8:24 AM8:50 AM9:21 AM9:57 AM10:27 AM10:50 AM11:53 AM12:35 PM12:50 PM1:32 PM1:45 PM2:32 PM2:50 PM3:20 PM3:50 PM4:25 PM4:50 PM5:25 PM6:06 PM6:25 PM6:47 PM7:25 PM7:50 PM8:27 PM8:50 PM9:27 PM10:05 PM10:47 PM11:31 PM
Canton Centerparkingaccessible5:53 AM7:27 AM8:28 AM9:28 AM10:30 AM12:38 PM1:35 PM2:35 PM3:24 PM4:28 PM5:29 PM6:29 PM7:29 PM8:30 PM9:30 PM10:50 PM
Stoughtonparkingaccessible6:01 AM7:36 AM8:37 AM9:37 AM10:38 AM12:46 PM1:43 PM2:43 PM3:34 PM4:39 PM5:40 PM6:39 PM7:37 PM8:38 PM9:38 PM10:58 PM
Sharonparkingaccessible4:56 AM6:56 AM7:56 AM8:56 AM10:03 AM10:56 AM11:59 AM12:56 PM1:51 PM2:56 PM3:56 PM4:17 PM4:56 PM5:17 PM6:12 PM6:53 PM7:56 PM8:56 PM10:11 PM11:37 PM
Mansfieldparkingaccessible5:04 AM7:04 AM8:04 AM9:04 AM10:11 AM11:04 AM12:07 PM1:04 PM1:59 PM3:04 PM4:04 PM4:25 PM5:04 PM5:25 PM6:20 PM7:01 PM8:04 PM9:04 PM10:19 PM11:45 PM
Attleboroparkingaccessible5:12 AM7:12 AM8:12 AM9:12 AM10:19 AM11:12 AM12:15 PM1:12 PM2:07 PM3:12 PM4:12 PM4:34 PM5:12 PM5:34 PM6:29 PM7:09 PM8:12 PM9:12 PM10:27 PM11:53 PM
Providenceparkingaccessible5:45 AM7:45 AM8:33 AM9:33 AM10:45 AM11:33 AM12:36 PM1:40 PM2:26 PM3:38 PM4:33 PM4:59 PM5:34 PM5:55 PM6:57 PM7:34 PM8:40 PM9:33 PM10:48 PM12:14 AM
TF Green Airportparkingaccessible6:00 AM8:00 AM11:00 AM1:55 PM3:53 PM5:13 PM6:09 PM7:12 PM8:55 PM11:03 PM
Wickford Junctionparkingaccessible6:18 AM8:15 AM11:18 AM2:10 PM4:10 PM5:31 PM6:31 PM7:28 PM9:10 PM11:20 PM

The first row specifies whether bikes are allowed on each train, while the subsequent rows indicate train departure times. Therefore, the rows do not all fit the same fact template. This implies that the table does not correspond to a relation.

Furthermore, even if the bikes row were removed, the fact template for the remaining rows would be very unwieldy:

[station] [ has | does not have] parking, [is | is not] accessible, has a Train 801 departure at [time], has a Train 861 departure at [time], …

All of those train numbers should really be data; they should not be part of the schema. In other words, the problem with this table is that it’s in wide form rather than tall form.

The best way to create a relational data model with the same information as this table is to disregard the original table shape altogether and instead think about the kinds of facts which are represented. There are four basic types:

Train [number] allows bikes.
[station] has parking.
[station] is accessible.
Train [number] arrives at [station] at [time].

Therefore, the information in the table may be modeled using four relations:

  1. allows_bikes:
train
861
803
865
805
867
839
  1. parking:
station
South Station
Forest Hills
Hyde Park
Route 128
Canton Junction
Canton Center
Stoughton
Sharon
Mansfield
Attleboro
Providence
TF Green Airport
Wickford Junction
  1. accessible
station
South Station
Back Bay
Ruggles
Forest Hills
Hyde Park
Route 128
Canton Junction
Canton Center
Stoughton
Sharon
Mansfield
Attleboro
Providence
TF Green Airport
Wickford Junction
  1. departure_time
stationtraintime
South Station8014:25 AM
South Station8615:25 AM
South Station8036:25 AM
South Station8657:00 AM
South Station8057:25 AM
Wickford Junction8276:31 PM
Wickford Junction8297:28 PM
Wickford Junction8339:10 PM
Wickford Junction83711:20 PM

If you want to print out the train schedule and display it in your home, the original table is preferable to these four tables. However, the four-relation model is much better for purposes of storing and querying the data in a database management system.

Nested Data

Applications often store data in nested data structures. For example, when you play a video game, some of the information about the state of your game might be stored in a form that looks like this:

{
  name: "Aniko",
  color: "green",
  power_ids: [117, 844, 512],
  levels: {
    1: { 
      points: 10, 
      tries: [0.4, 0.5, 0.4, 0.65, 0.8, 1.0] 
    },
    2: { 
      points: 8, 
      tries: [0.1, 0.1, 0.25, 0.85, 1.0] 
    },
    3: { 
      points: 10,
      tries: [0.3, 0.2, 0.45, 0.85, 0.9, 1.0]
    },
    4: {
      points: 9, 
      tries: [0.2, 0.1, 0.15, 0.6, 0.1, 0.2, 0.8, 0.9, 0.95, 1.0] 
    }
  }
}

This object is meant to be interpreted as follows:

  • Your character’s name is Aniko.
  • Your avatar is green.
  • Your powers are the ones whose ids are 117, 844, and 512.
  • On Level 1, you got 10 points after trying six times to complete the level. The first time you completed 40% of the level before failing and having to start over, then 50%, then 40% again, and so on.

The term nested refers to the fact that any of the values may store further compound values inside it. For example, the list of tries values is inside the value for a particular level number, which in turn is inside the levels value. Such nesting behavior is not allowed for a relation in First Normal Form.

One of the most commonly used formats for passing data around on the internet is a nested data format called JSON. Some database management systems, like MongoDB, are built to support nested data models rather than relational data models. For these reasons, it is very common to encounter nested data sources in practice.

Despite the structural differences between nested data and relations, you can build a relational model for this data by thinking about the types of facts which are represented. One way to organize the facts in the object above into a collection of sentence templates is as follows:

The character’s [attribute] is [value].
The character has power [power].
The character earned [score] points on level [level].
On level [level] and try number [try] on that level, you completed [proportion] of the level.

The tables for these templates look like this:

attributevalue
nameAniko
colorgreen
power
117
844
512
levelpoints
110
28
310
49
leveltryproportion
110.4
120.5
130.4
140.65
150.8
161.0
210.1
490.95
4101.0

Graphs

Relationships between real-world entities may be represented graphically, with points representing the entities and labeled lines representing the relationships:

This figure represents a situation in which Alice and Beatrice are friends, Beatrice and Carlos are friends, Alice and Carlos aren’t friends, Alice is French and was born on 1996-04-17, and so on.

The points are called nodes (or vertices) and pairs of nodes — indicated by the lines connecting them — are called relationships (or edges). A set of vertices connected in pairs by edges is called a graph.

The graph representation is well-suited to problems for which it is important to understand how things are related to each other. For example, it’s visually apparent from the graph that Carlos is a second-degree friend of Alice – that is, a friend of a friend. Some organizations use Graph Database Management Systems like Neo4j to build their entire database on a graph-based data model.

It is straightforward to represent a graph database as a relational database because a graph is a collection of relations. Specifically, each edge represents a pair of nodes, and a pair is a two-element tuple. Therefore, each edge type gives rise to a relation containing two-element tuples:

def friends = {
("Alice", "Beatrice");
("Beatrice", "Alice");
("Beatrice", "Carlos");
("Carlos", "Beatrice");
}

def birthday = {
("Alice", 1996-04-17);
("Beatrice", 1975-08-11);
("Carlos", 1981-11-06);
}

def nationality = {
("Alice", "France");
("Beatrice", "United States");
("Carlos", "Costa Rica");
}

Time Series

A times series is a sequence of values which are each associated with a date or time. For example, you might keep track of the performance of a stock over time by checking its price at 4 p.m. every trading day and recording a data point with that date and price.

Time series may be represented as relations with two-element tuples of the form (timestamp, value):

def stock_price = {
(2022-01-03, 617.05);
(2022-01-04, 619.37);
(2022-01-05, 618.11);
(2022-01-06, 618.88);
}

Arrays

One-dimensional arrays

A one-dimensional array is an ordered sequence of values:

$$A = [4, 18, 3, 11, 7]$$

Since the tuples in a relation have no particular order, it does not work to represent an array as a relation of one-element tuples:

def A = {4; 18; 3; 11; 7}

def output = A

Relation: output

3
4
7
11
18

Instead, you can create a relation of two-element tuples with both the index and element:

def A = {
(1, 4);
(2, 18);
(3, 3);
(4, 11);
(5, 7);
}

def output = A

Relation: output

14
218
33
411
57

Note: While tuples and arrays are both ordered sequences of values, you should not use a tuple to model an array in Rel, because the language is not designed to support that approach. Instead, use a collection of (index, value) pairs as described above.

Multi-dimensional arrays

A similar approach works for multi-dimensional arrays. For example, consider this two-dimensional array:

$$\left[\begin{array}{ccc} 24 & 14 & 11 \\ 7 & 18 & 12 \\ 5 & 4 & 9 \end{array}\right]$$

The idea is to use a relation of three-element tuples of the form (row index, column index, value):

def A = {
(1, 1, 24);
(1, 2, 14);
(1, 3, 11);
(2, 1, 7);
(2, 2, 8);
(2, 3, 12);
(3, 1, 5);
(3, 2, 4);
(3, 3, 9);
}

def output = A

Relation: output

1124
1214
1311
217
228
2312
315
324
339

This is sometimes called the “ijv” or “triplet” format for representing a two-dimensional array. You can access specific values using relational application:

def A = {
(1, 1, 24);
(1, 2, 14);
(1, 3, 11);
(2, 1, 7);
(2, 2, 8);
(2, 3, 12);
(3, 1, 5);
(3, 2, 4);
(3, 3, 9);
}

def output = A[1, 2]

Relation: output

14

Hierarchies

The National Basketball Association has 30 teams which are subdivided into two conferences of 15 teams each. Each conference is further subdivided into three divisions of 5 teams each. This is an example of a hierarchy, and it may be visualized using an outline:

  • Eastern Conference
    • Atlantic Division
      • Boston Celtics
      • New Jersey Nets
      • New York Knicks
      • Philadelphia 76ers
      • Toronto Raptors
    • Central Division
      • Chicago Bulls
      • Cleveland Cavaliers
      • Detroit Pistons
      • Indiana Pacers
      • Milwaukee Bucks
    • Southeast Division
      • Atlanta Hawks
      • Charlotte Bobcats
      • Miami Heat
      • Orlando Magic
      • Washington Wizards
  • Western Conference
    • Northwest Division
      • Denver Nuggets
      • Minnesota Timberwolves
      • Oklahoma City Thunder
      • Portland Trail Blazers
      • Utah Jazz
    • Pacific Division
      • Golden State Warriors
      • Los Angeles Clippers
      • Los Angeles Lakers
      • Phoenix Suns
      • Sacramento Kings
    • Southwest Division
      • Dallas Mavericks
      • Houston Rockets
      • Memphis Grizzlies
      • New Orleans Pelicans
      • San Antonio Spurs

Despite the structural differences between an outline and a table, hierarchical relationships may be modeled very reasonably as relations. To build a relational model, consider the kinds of facts which are implicit in the outline and create a relation for each fact template.

In this case, there are relationships between divisions and conferences:

The Atlantic Division is in the Eastern Conference
The Central Division is in the Eastern Conference
The Southeast Division is in the Eastern Conference
The Northwest Division is in the Western Conference
The Pacific Division is in the Western Conference
The Southwest Division is in the Western Conference

The other kind of relationship is the one between teams and divisions:

The Boston Celtics are in the Atlantic Division
The New Jersey Nets are in the Atlantic Division

Therefore, the NBA’s conference and division hierarchy may be modeled using two relations:

divisionconference
AtlanticEastern
CentralEastern
SoutheastEastern
NorthwestWestern
PacificWestern
SouthwestWestern
teamdivision
Boston CelticsAtlantic
Brooklyn NetsAtlantic
New York KnicksAtlantic
Philadelphia 76ersAtlantic
Toronto RaptorsAtlantic
Chicago BullsCentral
New Orleans PelicansSouthwest
San Antonio SpursSouthwest

Conclusion

In summary, relational data modeling is a flexible paradigm for representing facts in an organized way. The main idea is to create a collection of sentence templates which generalize the facts being represented. Each template gives rise to a relation whose tuples represent the ways that the template’s fields may be filled in to obtain a fact. A relational database is a collection of such relations.

The Normal Forms convey principles about how to split relations in such a way that redundancies are eliminated. Query operations allow us to retrieve data from the database, including filtering to obtain a subset of a relation’s tuples based on a logical condition and joins for combining data from multiple relations.


  1. Not every conceptual model is a data model. For example, friendship between individuals may be posited as a conceptual model for alliance between countries. The concept of friendship is a model in this example, but it is not data. ↩︎

  2. It should be understood in this definition that a key should not include more fields than necessary. For example, a row in the review table is uniquely determined by all three fields, but the rating field should not be included in the key since a row is also uniquely determined by just the hotel and user fields. ↩︎

  3. Actually, Graph Normal Form has a second requirement, which is that values in the database should be equal if and only if they are meant to correspond to the same thing in the real world. For more details, see the upcoming Graph Normal Form concept guide. ↩︎

  4. The designation of these specific operations as the fundamental ones goes back to Ted Codd’s paper A Relational Model of Data for Large Shared Data Banks, which was published in Communications of the ACM in June of 1970. ↩︎

  5. Visualization credit: R for Data Science↩︎

  6. The steps actually performed by database systems are often quite different from these steps because it’s often possible to achieve mathematically identical results with less work. ↩︎

  7. In some cases, it may be necessary to keep track of rows in one table that do not match any rows in the other table. Customarily this is done by adding a single row to the joined table for the unmatched row, with missing values for all of the fields from the other table.

    In a left join, any unmatched rows from the first table are retained in this way. In a right join, any unmatched rows from the second table are retained. In a full join, any unmatched rows from either table are retained.

    These variations of the join operation are not necessary in Rel because unmatched rows may be handled using separate logic. ↩︎