# Rel Primer: Aggregations, Group-by, and Joins

Explaining Aggregations, Group-by, and Joins in Rel

## Introduction

This Rel Primer section focuses on common database operations in Rel: aggregations, group-by aggregations, and joins.

See Basic Syntax for an introduction to the basic syntax of Rel, a pre-requisite for this Rel primer section.

## Aggregations

To demonstrate aggregations, we will import some basic information about soccer players (from Spring 2021),
store this data in an EDB relation (`player_csv`

),
and *install* a few derived relations, making them available for future use.

First, we import our data and save it in an EDB relation:

`def config:data = """`

name,salary,age,plays_for,nationality

Messi,70,32,BFC,Argentina

Griezmann,45,28,BFC,France

Busquets,15,31,BFC,Spain

Pique,12,32,BFC,Spain

Dembele,12,22,BFC,France

Umtiti,12,25,BFC,France

Cortois,7,28,RM,Belgium

Carvajal,7,28,RM,Spain

Ramos,15,34,RM,Spain

Varane,7,27,RM,France

Marcelo,7,32,RM,Brazil

Kroos,10,30,RM,Germany

Modric,10,35,RM,Croatia

"""

def config:schema = {

:name, "string";

:salary, "int";

:age, "int";

:plays_for, "string";

:nationality, "string"

}

def csv = load_csv[config]

def delete[:player_csv] = player_csv

def insert[:player_csv] = csv

def output = player_csv

In practice, `config:data`

would be a file uploaded to a notebook as a string relation, or `config:path`

would be specified
to point to a CSV file location. See the CSV Import How-To Guide for details.

Now we install some derived relations:

`def player(prop, name, val) =`

player_csv(prop, row, val)

and player_csv(:name, row, name) from row

def name(x) = player(_, x, _)

def salary = player:salary

def age = player:age

def plays_for = player:plays_for

def nationality = player:nationality

def team(t) = player:plays_for(_, t) // set of teams

Once installed, these definitions are available for querying (see Installing Models).
(For an explanation of `_`

, which helps collect the set of teams in the `team`

relation,
see the Projection section below.
For an explanation of `from`

, see below.
Expressions like `player:salary`

use the Rel module syntax.)

`player_csv`

is an EDB relation, created by the *update*transaction that loaded the CSV data and inserted it into

`player_csv`

, above. The relations `player`

, `name`

, `salary`

, etc. are IDB relations,
sometimes known as *views*, based on the EDB data. The

*install*transaction above makes them available to subsequent queries.

Since `player`

is now installed, we can query it:

`def output = player`

Relation: output

:age | "Busquets" | 31 |

:age | "Carvajal" | 28 |

:age | "Cortois" | 28 |

:age | "Dembele" | 22 |

:age | "Griezmann" | 28 |

:age | "Kroos" | 30 |

:age | "Marcelo" | 32 |

:age | "Messi" | 32 |

:age | "Modric" | 35 |

:age | "Pique" | 32 |

:age | "Ramos" | 34 |

:age | "Umtiti" | 25 |

:age | "Varane" | 27 |

:name | "Busquets" | "Busquets" |

:name | "Carvajal" | "Carvajal" |

:name | "Cortois" | "Cortois" |

:name | "Dembele" | "Dembele" |

:name | "Griezmann" | "Griezmann" |

:name | "Kroos" | "Kroos" |

:name | "Marcelo" | "Marcelo" |

:name | "Messi" | "Messi" |

:name | "Modric" | "Modric" |

:name | "Pique" | "Pique" |

:name | "Ramos" | "Ramos" |

:name | "Umtiti" | "Umtiti" |

:name | "Varane" | "Varane" |

:nationality | "Busquets" | "Spain" |

:nationality | "Carvajal" | "Spain" |

:nationality | "Cortois" | "Belgium" |

:nationality | "Dembele" | "France" |

:nationality | "Griezmann" | "France" |

:nationality | "Kroos" | "Germany" |

:nationality | "Marcelo" | "Brazil" |

:nationality | "Messi" | "Argentina" |

:nationality | "Modric" | "Croatia" |

:nationality | "Pique" | "Spain" |

:nationality | "Ramos" | "Spain" |

:nationality | "Umtiti" | "France" |

:nationality | "Varane" | "France" |

:plays_for | "Busquets" | "BFC" |

:plays_for | "Carvajal" | "RM" |

:plays_for | "Cortois" | "RM" |

:plays_for | "Dembele" | "BFC" |

:plays_for | "Griezmann" | "BFC" |

:plays_for | "Kroos" | "RM" |

:plays_for | "Marcelo" | "RM" |

:plays_for | "Messi" | "BFC" |

:plays_for | "Modric" | "RM" |

:plays_for | "Pique" | "BFC" |

:plays_for | "Ramos" | "RM" |

:plays_for | "Umtiti" | "BFC" |

:plays_for | "Varane" | "RM" |

:salary | "Busquets" | 15 |

:salary | "Carvajal" | 7 |

:salary | "Cortois" | 7 |

:salary | "Dembele" | 12 |

:salary | "Griezmann" | 45 |

:salary | "Kroos" | 10 |

:salary | "Marcelo" | 7 |

:salary | "Messi" | 70 |

:salary | "Modric" | 10 |

:salary | "Pique" | 12 |

:salary | "Ramos" | 15 |

:salary | "Umtiti" | 12 |

:salary | "Varane" | 7 |

### The Last Argument

When writing a table, it is natural to put the *keys* first and the *values* last:
players and their age (as above); players and their salary; or graph edges and their weight.

Rel provides a number of common operations that operate on the last argument of a relation.
For example,
the standard library includes utilities for the
`max`

, `min`

, `sum`

, and `average`

of a relation,
taking the last argument of the relation as the value to be aggregated.
The basic salary stats for our installed `player`

relation
can be computed as follows:

`def output = { (:sum, sum[salary]);`

(:average, average[salary]);

(:count, count[salary]);

(:argmax, argmax[salary]) }

Relation: output

:argmax | "Messi" |

:average | 17.615384615384617 |

:count | 13 |

:sum | 229 |

Note that these two numbers are different:

The first, `count[salary]`

is the number of rows in the `salary`

relation.
The relation `x : salary(_, x)`

contains the *unique* values found in the second argument.
As different players have the same salary, the second number is smaller.
This is important to keep in mind when computing averages and sums, as we will see below in
the section on group-by and aggregations.

### Aggregating Over the Empty Relation

In most cases, aggregating over an empty set gives an empty set, rather than, say, 0.
For example, `count[x in name : salary[x] < 0]`

is `{}`

, rather than `{0}`

.
This is a design choice that simplifies the semantics of the language, and often results in more sparse intermediate data,
where the default (0) does not have to be represented.

If we want to include the default, we can use the *override* operator, `left_override`

, also known as `<++`

,
from the Rel Standard Library.

For example:

`def output = c in {"RM"; "BFC"; "Chelsea"} :`

sum[p where plays_for(p, c) : salary[p] ] <++ 0

Relation: output

"BFC" | 166 |

"Chelsea" | 0 |

"RM" | 63 |

Without `<++ 0`

, the row for “Chelsea” would not be included in the results.
(We discuss`<++`

further
in the
advanced syntax section
of this Rel primer.)

## Bindings First or Bindings Last

Many Rel expressions result in a relation that has value keys first, and one or more metrics that follow. Writing these expressions in Rel, we can have the bindings go first and the values (metrics) go last, or vice-versa, depending on what feels more natural.

Consider an example from Part I, where we use `:`

, so the bindings go first and the values go last:

`def mydomain = range[1, 5, 1]`

def output = x in mydomain, y in mydomain where x + y = 5 : x-y, x+y, x*y

Relation: output

1 | 4 | -3 | 5 | 4 |

2 | 3 | -1 | 5 | 6 |

3 | 2 | 1 | 5 | 6 |

4 | 1 | 3 | 5 | 4 |

We now show how you can move the bindings (which include the `where`

and `in`

constraints) to the other side of the `:`

.

`for`

, Also Known as `|`

The Rel construct `for`

can be used instead of `:`

to put the values first and the bindings last, which can sometimes
make things easier to read — for example, when the `where`

condition is itself a complex expression.
Note that the result is exactly the same.

`def mydomain = range[1, 5, 1]`

def output = x-y, x+y, x*y for x in mydomain, y in mydomain where x+y = 5

Relation: output

1 | 4 | -3 | 5 | 4 |

2 | 3 | -1 | 5 | 6 |

3 | 2 | 1 | 5 | 6 |

4 | 1 | 3 | 5 | 4 |

Even though they read differently, these two definitions are equivalent.
In both cases, the values of `x`

and `y`

will appear first in the result tuples.

Let’s go back to aggregations and say we want to compute the average salary of players under 30.

By choosing to use `:`

or `for`

,
Rel lets us put the condition first and the metric
(the value being aggregated, in this case, `salary`

) last,
or the metric first and the condition last.

Condition first, metric second, using `:`

:

Metric first, condition last, using `for`

(or its alias `|`

) :

In general, for an expression `Expr`

and bindings `b`

,
`Expr for b`

is equivalent to `b : Expr`

.
For a more mathematical notation, `Expr for b`

can also be written as `Expr | b`

. For example:

We can read `|`

as “such that”, remembering that the bound variables are included at the *beginning* of the resulting tuples
(which is what we want for correct aggregation results).

`b : e`

(bindings at the left) or `e for b`

(aka `e | b`

, bindings at the right)
to make sure you are not conflating keys that have the same metric value.
Use `from`

only if you want to quantify away keys, and possibly remove duplicate values as a result.`e | b`

is equivalent to `e for b`

, which is equivalent to `b: e`

.
In all three cases, the result keeps the tuples from `b`

, making it safe for aggregation.## Group-By

The `:`

operator lets us do group-by aggregations easily. For example,
to see the average age for each team:

`def output = x in team : average[p where plays_for(p, x): age[p] ]`

Relation: output

"BFC" | 28.333333333333332 |

"RM" | 30.571428571428573 |

(Technical footnote: the `in team`

clause is not really needed, since `plays_for`

constrains the values of x in the same way.)

If we prefer, we can write this relation as:

`def output[x in team] = average[p where plays_for(p, x): age[p] ]`

As this is more readable, we adopt this style below.

To see the average salary and count, grouped by age:

`def output[a] = average[p where age(p, a) : salary[p]], count[p : age(p, a)]`

Relation: output

22 | 12.0 | 1 |

25 | 12.0 | 1 |

27 | 7.0 | 1 |

28 | 19.666666666666668 | 3 |

30 | 10.0 | 1 |

31 | 15.0 | 1 |

32 | 29.666666666666668 | 3 |

34 | 15.0 | 1 |

35 | 10.0 | 1 |

Average salary and count, grouped by nationality:

`def output[n] = average[p where nationality(p, n) : salary[p]],`

count[p : nationality(p, n)]

Relation: output

"Argentina" | 70.0 | 1 |

"Belgium" | 7.0 | 1 |

"Brazil" | 7.0 | 1 |

"Croatia" | 10.0 | 1 |

"France" | 19.0 | 4 |

"Germany" | 10.0 | 1 |

"Spain" | 12.25 | 4 |

`for`

Keeps Variables, `from`

Does Not

When aggregating, we usually want to use `for`

(or its equivalent, `|`

). For example:

`def output:right = sum[salary[x] for x in name`

where plays_for(x, "RM") and age[x] < 30]

def output:wrong = sum[salary[x] from x in name

where plays_for(x, "RM") and age[x] < 30]

Relation: output

:right | 21 |

:wrong | 7 |

(Here we are using module notation to make `right`

and `wrong`

sub-relations of `output`

.
See the Rel Modules concept guide).

There are three players satisfying the condition, all with a salary of 7. In the first aggregation, we are taking the sum of this relation:

`def output = salary[x] for x in name where plays_for(x, "RM") and age[x] < 30`

Relation: output

"Carvajal" | 7 |

"Cortois" | 7 |

"Varane" | 7 |

In the second, since we are existentially quantifying away the player, we are just taking the sum of the relation `{7}`

.

`for`

to keep the variables, and `from`

to existentially quantify them away.## Joining Relations

In database parlance, a *join* combines columns from different tables to build a new one, based on common values in the rows of each table.
In Rel, this corresponds to defining new relations,
based on common values between the tuples in the joined relations.

It is simple to join relations in Rel: `and`

(or `,`

) will suffice. For example, to get a list of players with their team and nationality
we join the `plays_for`

and `nationality`

relations:

`def output(player, team, country) = plays_for(player, team) and`

nationality(player, country)

Relation: output

"Busquets" | "BFC" | "Spain" |

"Carvajal" | "RM" | "Spain" |

"Cortois" | "RM" | "Belgium" |

"Dembele" | "BFC" | "France" |

"Griezmann" | "BFC" | "France" |

"Kroos" | "RM" | "Germany" |

"Marcelo" | "RM" | "Brazil" |

"Messi" | "BFC" | "Argentina" |

"Modric" | "RM" | "Croatia" |

"Pique" | "BFC" | "Spain" |

"Ramos" | "RM" | "Spain" |

"Umtiti" | "BFC" | "France" |

"Varane" | "RM" | "France" |

We can use `from`

to existentially quantify away variables we do not want in the result
(see the following section).
For example, if we just want to see the nationalities playing for each team, we can write:

`def output(team, country) = plays_for(player, team) and`

nationality(player, country) from player

Relation: output

"BFC" | "Argentina" |

"BFC" | "France" |

"BFC" | "Spain" |

"RM" | "Belgium" |

"RM" | "Brazil" |

"RM" | "Croatia" |

"RM" | "France" |

"RM" | "Germany" |

"RM" | "Spain" |

## Projection (Existential Quantification)

When manipulating relations, we often want to remove one or more columns and keep the others.
We have already seen a special case of this,
the relational application operator
(`[]`

),
which additionally restricts the removed columns to have particular values.
In general, we can use `exists`

:

`def myrel = { (1, "a", 2); (1, "b", 3); (3, "a", 6); (4, "b", 12) }`

def output(y) = exists(x, z : myrel(x, y, z))

Relation: output

"a" |

"b" |

There are two other ways to indicate existential quantification: `_`

(underscore) and `from`

.

`def myrel = { (1, "a", 2); (1, "b", 3); (3, "a", 6); (4, "b", 12) }`

def output(x) = myrel(_, x, _)

Relation: output

"a" |

"b" |

We can think of `_`

as a “wildcard” variable, which will match anything. More than one `_`

can be used, unrelated to each other.

The `from`

construct also existentially quantifies away variables. This definition is equivalent to the previous one:

`def myrel = { (1, "a", 2); (1, "b", 3); (3, "a", 6); (4, "b", 12) }`

def output(y) = myrel(x, y, z) from x, z

Relation: output

"a" |

"b" |

## Universal Quantification

Rel also supports universal quantification, provided the variable quantified over has a finite domain:

In general, to restrict the domain being quantified over, use `in`

to restrict single variables,
and `where`

to restrict combinations of variables:

`def output = if`

forall(x in {1; 2; 3}, y in {4; 5; 6} where x + y < 10 : x + y < 8)

then "yes" else "no" end

Relation: output

"no" |

Following the rules of logic, if the domain is empty (`false`

), the result is always `true`

:

`def mydomain = {1; 2; 3; 4}`

def output = if

forall(x where mydomain(x) and x < 0 : 5 < 4)

then "yes" else "no" end

Relation: output

"yes" |

## Summary

This article has covered common database operations as expressed in Rel: aggregations, group-by aggregations, and joins. For more in our Rel Primer series, see Advanced Syntax.