Integrity Constraints

This concept guide covers integrity constraints in Rel.

Download this guide as a RAI notebook by clicking here.

Goal

Integrity Constraints (ICs) are used to ensure accuracy and consistency of the data in a relational database. They can guarantee that data insertion, updating, and other processes are performed in such a way that data integrity is not affected. The goal of this concept guide is to familiarize the reader with Rel’s IC functionality through a number of examples. After reading this concept guide you should expect to know how to write Integrity Constraints in Rel.

Introduction

An integrity constraint (IC) ensures, as the name suggests, the integrity of the database, requiring that its relations and data obey the constraint specified. We can broadly classify ICs into three categories:

  • Type Constraints are used to check data types (e.g.: String(name)). These constraints are inherent to the data model (aka schema) and usually can be evaluated statically. They are discussed in detail in Section Type Constraints.
  • Value Constraints are used to check that individual entries in a relation hold a specific value or lie within a specific range (e.g.: age >= 0). They are discussed in detail in Section Value Constraints.
  • Logical Constraints are used to ensure that relations adhere to specific logical rules. They often involve multiple relations. Logical constraints are also a great way to express expert knowledge in our RKGMS. They are discussed in detail in Section Logical Constraints.

In Rel, an integrity constraint is defined in a similar way as any relation. Consequently, the same language constructs (i.e.: Relational Abstractions, Relational Application, and more) can be used. The technical details of how integrity constraints are defined can be found in the Integrity Constraints section in the Language reference.

In contrast to defining a relation, where variables are kept only if they adhere to the logical statement, an IC probes the opposite, i.e., variable combinations are kept only when they don’t fulfill the logical statement.

ICs may involve a single fact, a single relation, or multiple relations. In Section Let’s Get Started we show the basics of how to write integrity constraints.

More advanced ICs are discussed in Section Use Cases. More specifically, we discuss type, value and logical constraints. In Section Mini-IMDb Dataset we show how ICs can be enforced on real-world data.

When an IC is violated, the running transaction is aborted and the appropriate information is logged in the output relation. It is even possible to write ICs that report back counter-examples that violate the IC.

As we will see, the best practices when writing ICs are:

  • to give descriptive names to ICs and
  • to use arguments in the IC declaration.

Following both points will be most helpful when analyzing IC violations. Naming the IC will help us identify which IC is violated; and IC arguments will help us understand the reason for an IC violation, because counter-examples will be reported back for the specified arguments. In Section Best Practices we discuss in detail the advantages of these two points.

Let’s Get Started

Simple Example

A very simple IC is:

query
ic {
1 < 2
}

which checks the correctness of the formula 1 < 2. This IC, of course, holds independently of any particular data in the database.

Generally, an IC involves one or more relations because its purpose is to ensure the integrity of the data, which is stored inside relations. To check, for instance, that two relations are the same we can write:

query
def R = {1;2}, {'a'; 'b'}
def S = {(1, 'a'); (1, 'b'); (2, 'a'); (2, 'b')}

ic {
equal(R, S)
}

Note, that equal(R, S) is not the same as R = S when R or S is a relation and not a scalar value. For more details see the definition of equal.

Named Integrity Constraints

We can give a name/identifier to an IC, as shown below. This is optional: if no name is provided by the user, the system assigns a unique identifier to the IC.

query
ic equality_check {
equal(1, 1)
}

Giving descriptive names to ICs will help identify them if they are violated (see Section Integrity Constraint Violations.

Integrity Constraint Violations

In this section, we will discuss IC violations and what information about the violation is reported back to the user. In Section Best Practices, we discuss how users can ensure they get the most actionable information back when an IC is violated.

Let’s see what happens when we define an IC with a statement that evaluates to false:

ic violation_check {
1 = 2
}

As you can see below, the transaction is aborted due to the IC violation. In general, any transaction that triggers an IC violation will be aborted. This may happen because (1) we try to define an IC that does not hold given the current data, or (2) we try to change the data in a way that violates an existing IC.

ERROR: LoadError: ClientException: Transaction aborted. The following problems caused the abort:
-- INTEGRITY_CONSTRAINT_VIOLATION ---------------------------------------------------
Integrity constraints are violated. The ICs are defined as follows:

2| ic violation_check = {
   ^^^^^^^^^^^^^^^^^^^^^^^
3|     1 = 2
   ^^^^^^^^^^^^^^^^
4| }
   ^^


error: ICs violated
Transaction output:
    abort() => [()]
    output(:violation_check) => [()]

The error message highlights which IC is violated. Violated ICs are also reported in the output relation. The line output(:violation_check) => [()] states that the relation violation_check, which we call the IC relation, holds the tuple () (which represents true). Because it is not empty, the IC is violated.

Concretely, in the above example, the IC relation violation_check is equivalent to the expression: not 1 = 2, which is, of course, true. Hence violation_check holds a logical true, represented by the tuple ().

In Section The Advantage of Integrity Constraints with Arguments we will see that if an IC is defined with arguments then the IC relation contains all counter-examples that violate the IC.

Best Practices

In this section, we discuss the best practices for how to write ICs such that we have the most actionable information at hand when an IC is violated.

The Advantage of Integrity Constraints with Arguments

In the previous examples, each IC had no arguments and the IC violation could only tell us that the IC was violated. To gain more information about why and for which values the IC was violated, we need to add arguments to the IC.

For example, when we check that relation person[:age]

def person[:age] = {
(1, 37);
(2, 31);
(3, 47);
(4, 27);
(5, 12);
}

contains only people that are older than 18

ic value_violation(id, age) {
person[:age](id, age)
implies
age >= 18
}

we get the following error report:

ERROR: LoadError: ClientException: Transaction aborted. The following problems caused the abort:
-- INTEGRITY_CONSTRAINT_VIOLATION ---------------------------------------------------
Integrity constraints are violated. The ICs are defined as follows:

10| ic value_violation(id, age) {
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
11|     person[:age](id, age)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
12|     implies
    ^^^^^^^^^^^^
13|     age >= 18
    ^^^^^^^^^^^^^^
14| }
    ^^


error: ICs violated
Transaction output:
    value_violation(Int64, Int64) => [(5, 12)]

The report points out that the person with id = 5 has age 12, and this fact is the cause of the IC violation, since 12 < 18.

If we write the same IC without arguments, expecting it to hold for all id and age values,

ic value_violation{
forall(id, age:
person[:age](id, age)
implies
age >= 18
)
}

the IC violation is reported as follows:

ERROR: LoadError: ClientException: Transaction aborted. The following problems caused the abort:
-- INTEGRITY_CONSTRAINT_VIOLATION ---------------------------------------------------
Integrity constraints are violated. The ICs are defined as follows:

10| ic value_violation() {
    ^^^^^^^^^^^^^^^^^^^^^^^
11|     forall(id, age:
    ^^^^^^^^^^^^^^^^^^^^
12|         person[:age](id, age)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
13|         implies
    ^^^^^^^^^^^^^^^^
14|         age >= 18)
    ^^^^^^^^^^^^^^^^^^^
15|     }
    ^^^^^^


error: ICs violated
Transaction output:
    value_violation() => [()]

There is no additional information about which data record caused the IC violation.

In particular, the only information reported back is that the IC evaluated to true (the relation {()}), which means the expression in the IC evaluated to false.

Unbound argument errors when evaluating ICs

When checking an IC, the system evaluates its negation. Any instance that satisfies the negation will be a counter-example. For the evaluation to take place, the variables must be bound to a finite domain.

Otherwise, it is possible that an infinite number of values can violate the IC. This happens when arguments are not bound, which triggers an unbound variable error.

Consider this example:

def R = {1; 2}

ic check_R_values(id) {
R(id) and id <= 2
}

It results in an error that tells us that (1) the IC is violated and (2) the variable id is unbound:

INTEGRITY_CONSTRAINT_VIOLATION
error: ICs violated
-- UNBOUND_VARIABLE -----------------------------------------------------------------
3| ic check_R_values(id) {
   ^^^^^^^^^^^^^^^^^^^^^^^^^^
4|     R(id) and id <= 2
   ^^^^^^^^^^^^^^^^^^^^^^
5| }
   ^^

error: expression is not bound. Maybe this is a typo, or consider adding '@inline' if the definition is intentionally infinite?
-- UNBOUND_VARIABLE -----------------------------------------------------------------
3| ic check_R_values(id) {
                     ^^^

error: 'id' is not bound. Maybe this is a typo, or consider adding '@inline' if the definition is intentionally infinite?
Transaction output:
    abort() => [()]

At a first look, these error reports might be surprising, because all values in R fulfill the condition id <= 2. However, because an IC relation is the negation of the specified expression, the IC relation check_R_values(id) is equivalent to

id: not (R(id) and id <= 2)

and there are an infinite number of id values that make the statement not (R(id) and id <= 2) true. In particular, nothing here says that the values of id must be in R. For instance, values like id = 3 or id = "abc" are valid counter-examples that violate the IC.

We can rewrite the above IC using implies to bind the variable id to relation R and get the desired integrity check, which is valid:

query
def R = {1; 2}

ic check_R_values(id) {
R(id) implies id <= 2
}

The Advantage of Named Integrity Constraints

Similar considerations apply to naming ICs. The system creates a unique identifier for ICs that lack a user-provided name. If you have multiple ICs without a name, it will be hard to know which IC was violated.

Let’s take the following toy example:

ic (id){
id = {1;2;3}
implies
id <= 2
}

ic (id){
id = {1;2;3}
implies
id < 2
}

Now both ICs are violated, and the system reports the id values that make them fail:

ERROR: LoadError: ClientException: Transaction aborted. The following problems caused the abort:
-- INTEGRITY_CONSTRAINT_VIOLATION ---------------------------------------------------
Integrity constraints are violated. The ICs are defined as follows:

1| ic (id){
   ^^^^^^^^^
2|     id = {1;2;3}
   ^^^^^^^^^^^^^^^^^
3|     implies
   ^^^^^^^^^^^^
4|     id <= 2
   ^^^^^^^^^^^^
5| }
   ^^

 7| ic (id){
    ^^^^^^^^^
 8|     id = {1;2;3}
    ^^^^^^^^^^^^^^^^^
 9|     implies
    ^^^^^^^^^^^^
10|     id < 2
    ^^^^^^^^^^^
11| }
    ^^


error: ICs violated
Transaction output:
    abort() => [()]
    output(:rel-query-action##3385#constraint#0, Int64) => [(3,)]
    output(:rel-query-action##3385#constraint#1, Int64) => [(2,), (3,)]

However, we didn’t name the ICs and therefore we can’t infer which counter-examples correspond to which IC.

Let’s summarize the insight we gained from the the last sections. The best practices are (1) give ICs descriptive names and (2) define ICs with useful arguments such that the error reports, when IC violations occur, provide the most actionable information for analyzing why the IC violation was triggered.

Use Cases

In the following, we demonstrate the various use cases for integrity constraints. To do so, we will define a person relation which contains people and their attributes. As we progress, we will define more and more person attributes as needed to showcase different IC use cases.

Type Constraints

We start with defining the name attribute — and in particular, person[:name, :first] and person[:name, :last] to store their first and last names.

install
def person[:name, :first] = {
(1, "Fred");
(2, "David");
(3, "Brian");
(4, "Jacques");
(5, "Julie");
}

def person[:name, :last] = {
(1, "Smith");
(2, "Johnson");
(3, "Williams");
(4, "Brown");
(5, "Jones");
}

Let’s query it to see how our data looks.

query
person

Relation: output

:name:first1"Fred"
:name:first2"David"
:name:first3"Brian"
:name:first4"Jacques"
:name:first5"Julie"
:name:last1"Smith"
:name:last2"Johnson"
:name:last3"Williams"
:name:last4"Brown"
:name:last5"Jones"

The name attribute is modeled in the highly normalized Graph Normal Form (GNF). GNF encourages dividing the attribute :name into the more basic sub-attributes :first and :last. We now want to check that the first and last names are strings:

query
ic person_type_check(id, sub_attribute){
forall(x:
person[:name, sub_attribute](id, x) implies String(x)
)
}

The IC person_type_check checks that all sub-attributes under person[:name] have value type String. Let’s say that we want to add another sub-attribute to person[:name] called :MI, with the person’s middle initial. We would also have to store it as String to not violate the IC. For instance, we could not store it as Char without violating or modifying the IC.

Type ICs can often be evaluated statically, based only on the schema, without having to actually evaluate the relation. With the @static attribute, we can tell the RKGMS that we expect that this IC can be statically evaluated.

query
@static
ic person_type_check(id, type) {
forall(x :
person[:name, type](id, x) implies String(x)
)
}

If an IC named ic_name is annotated as @static but cannot be statically evaluated, the system will issue a warning:

warn: declaration 'ic_name' is marked @static but not computed statically

Value Constraints

The next type of constraints we want to check is value constraint. For that, we add an age attribute, :age, to our person relation.

install
def person[:age] = {
(1, 37);
(2, 31);
(3, 47);
(4, 27);
(5, 22);
}
def person_age = person[:age]

and check that all individuals are 18 or older:

query
ic is_adult(id, age) {
person[:age](id, age)
implies
age >= 18
}

Value Constraints can also be used to check the arity of relations. We can check that person[:age] has arity 2.

query
ic arity_check(x){
arity[person_age] = x implies x = 2
}

Note that by using the IC argument x, we will be informed of the actual value of any arity other than 2 if the IC is violated.

The cardinality of a relation can also be checked using value constraints. We can check that person[:age] has 5 records.

query
ic cardinality_check(x){
count[person_age] = x implies x = 5
}

Logical Constraints

Disjoint

Next we will define a :gender attribute in person and check the disjoint property of the derived male (‘M’), female (‘F’), and non_binary (‘X’) relations.

install
def person[:gender] = {
(1, 'M');
(2, 'M');
(3, 'F');
(4, 'F');
(5, 'X');
}

def male(p) = person(:gender, p, 'M')
def female(p) = person(:gender, p, 'F')
def non_binary(p) = person(:gender, p, 'X')
def gender(p) = person(:gender, p, _)

The integrity constraint reads,

query
ic is_disjoint() {
disjoint(female, male) and
disjoint(non_binary, male female)
}

where we check that male and female are disjoint relations and non_binary is disjoint from the union (indicated by of the male and female relations.

Subset

We will use the same gender attribute and check whether the derived gender relations are subsets of gender.

query
ic is_subset() {
subset(female, gender) and
male gender and
non_binary gender
}

In Rel, the built-in \subset{subset} relation can be called either by its name or by the symbol .

Unique Value

Using the :age attribute in person we will define a unique value constraint to check that each person has only one age value.

install
def person[:age] = {
(1, 37);
(2, 31);
(3, 47);
(4, 27);
(5, 22);
}
query
ic is_unique(id) {
forall(age1, age2:
(person[:age](id, age1) and
person[:age](id, age2))
implies
age1 = age2
)
}

In the IC above, the id is an IC argument and in case of a violation, the error report returns all id values that violate the IC.

The uniqueness of a person’s age can also be checked using the built-in function relation:

query
ic is_unique {
function(person[:age])
}

Another alternative to check the uniqueness is to count all ages associated with a person:

query
ic is_unique(id, x) {
count[person[:age, id]] = x implies x = 1
}

Symmetry

Now we want to add a logical constraint to check the symmetry of a relation. To demonstrate this, we define an is_friends relation that collects pairs of people that are friends with each other. Obviously, friendship is mutual and consequently the relation is_friends should reflect that. We do so by checking that the relation is symmetric w.r.t. both arguments:

install
def is_friends = {
(1, 2);
(1, 5);
(2, 1);
(3, 4);
(4, 3);
(5, 1);
}

ic is_symmetric(id1, id2) {
is_friends(id1, id2) implies is_friends(id2, id1)
}

The symmetry of a relation can also be checked in a point-free way (which lets us skip the id1 and id2 variables) with the built-in transpose relation defined in the Standard Library.

query
ic is_symmetric {
equal(is_friends, transpose[is_friends])
}

While this formulation is more compact, it has no arguments, so it will not show us the counter-example values for id1, id2 if the IC is violated.

Transitivity

Another logical constraint that might be of interest is transitivity. To demonstrate this, we are defining an attribute :dob which refers to the date of birth of a person, and an is_older relation contains pairs (a, b) where person a is older than person b.

install
def person[:dob] = {
(1, 1984);
(2, 1990);
(3, 1974);
(4, 1994);
(5, 1999);
}

def is_older(id1, id2) { person[:dob][id1] < person[:dob][id2]}

We know that the relation is_older should be transitive: if person a is older than person b which is older than person c, then a is also older than c. So let’s check that this is true by writing the following IC:

query
ic is_transitive {
is_older.is_older is_older
}

A relation that doesn’t have to be transitive is the friendship relation. If a is friends with b and b is friends with c then we can’t conclude that a is (or isn’t) friends with c. We can check whether or not our is_friends relation is transitive.

ic is_transitive(id1, id2, id3) {
is_friends(id1, id2) and is_friends(id2, id3)
implies
is_friends(id1, id3)
}

We use the point-wise notation, where we state explicitly the relation variables id1, id2, and id3, so we can capture friend combinations where the transitivity condition doesn’t hold, and indeed we get the following IC violation:

ERROR: LoadError: ClientException: Transaction aborted. The following problems caused the abort:
-- INTEGRITY_CONSTRAINT_VIOLATION ---------------------------------------------------
Integrity constraints are violated. The ICs are defined as follows:

15| ic is_transitive(id1, id2, id3) {
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16|     is_friends(id1, id2) and is_friends(id2, id3)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
17|     implies
    ^^^^^^^^^^^^
18|     is_friends(id1,id3)
    ^^^^^^^^^^^^^^^^^^^^^^^^
19| }
    ^^


error: ICs violated
Transaction output:
    abort() => [()]
    output(:is_transitive, Int64, Int64, Int64) => [(1, 2, 1), (1, 5, 1), (2, 1, 2), (2, 1, 5), (3, 4, 3), (4, 3, 4), (5, 1, 2), (5, 1, 5)]

From the error report we see that we have 8 tuples (a, b, c) that violate our transitivity condition. In particular, we have two classes of violations:

  • trivial violations like (1, 2, 1) where the first and last persons are the same and you can’t be friends with yourself.
  • more interesting triples like (2, 1, 5) where all three people are different.

This example nicely highlights why arguments in your IC definition can be very helpful in understanding the data.

Mini-IMDb Dataset

In this section, we use the Mini-IMDb data as an example of a real world dataset, and show how ICs can enforce a schema and encode expert knowledge in our database.

Loading Data

To load the Mini-IMDb dataset, we use the load_csv functionality. First we define a schema and a path to our data files:

install
def url = "s3://relationalai-documentation-public/imdb_job/data_mini/"

def config_name[:path] = concat[url, "name.csv"]
def config_name[:schema, :person_id] = "int"
def config_name[:schema, :name] = "string"
def name_csv = load_csv[config_name]

def config_title[:path] = concat[url, "title.csv"]
def config_title[:schema, :movie_id] = "string"
def config_title[:schema, :title] = "string"
def config_title[:schema, :year] = "int"
def title_csv = load_csv[config_title]

def config_cast_info[:path] = concat[url, "cast_info.csv"]
def config_cast_info[:schema, :movie_id] = "int"
def config_cast_info[:schema, :person_id] = "int"
def cast_info_csv = load_csv[config_cast_info]

Next we define relations to access the data in Rel:

install
def movie[:title] = title_csv[:movie_id, pos], title_csv[:title, pos] from pos
def movie[:year] = title_csv[:movie_id, pos], title_csv[:year, pos] from pos
def movie[:id] = first[movie[_]]

def actor[:name] = name_csv[:person_id, pos], name_csv[:name, pos] from pos

def cast = cast_info_csv[:movie_id, pos], cast_info_csv[:person_id, pos] from pos

def co_actors(id1, id2) =
cast(movie_id, id1) and
cast(movie_id, id2) and
id1 != id2
from movie_id

Defining Integrity Constraints

Now, let’s define ICs on this Mini-IMDb dataset. First we check that the actor name is a String and the actor identifier is a Number.

Note that there is an advantage in creating entities for concepts like actors or movies and using the entity keys to refer to them compared to using simple identifiers like integers. For details see the Entities concept guide.

query
@static
ic person_type_check(name, id) {
actor(:name, id, name)
implies
Number(id) and String(name)
}

In general, we should check all the other data types. For the purpose of this concept guide, we will consider only this one example.

As discussed above, we can also insert expert knowledge into our database using ICs. For instance, we can check that no movie year is made before 1874 because we know the oldest entry in IMDb is the documentary “Passage de Venus” made in 1874.

query
ic value_check(mid, year) {
movie(:year, mid, year)
implies
year > 1874
}

Next, we can also check that our co-author relation, co_actors, is symmetric as it should be:

query
ic is_symmetric {
equal(co_actors, transpose[co_actors])
}

and indeed it is.

The Mini-IMDb dataset is quite small and the number of useful ICs is quite small. As the dataset grows and more information is stored in the knowledge graph, the need for integrity constraints also grows, to ensure that the knowledge is stored in a correct and consistent manner.