Skip to content
  • Rel
  • REL PRIMER
  • Aggregation, Group-by, and Joins

Rel Primer: Aggregations, Group-by, and Joins

Explaining Aggregations, Group-by, and Joins in Rel

Introduction

This Rel Primer section focuses on common database operations in Rel: aggregations, group-by aggregations, and joins.

See Basic Syntax for an introduction to the basic syntax of Rel, a pre-requisite for this Rel primer section.

Aggregations

To demonstrate aggregations, we will import some basic information about soccer players (from Spring 2021), store this data in a base relation (player_csv), and install a few derived relations, making them available for future use.

First, we import our data and save it in a base relation:

๐Ÿ‘‡

update

// update
 
def config:data = """
    name,salary,age,plays_for,nationality
    Messi,70,32,BFC,Argentina
    Griezmann,45,28,BFC,France
    Busquets,15,31,BFC,Spain
    Pique,12,32,BFC,Spain
    Dembele,12,22,BFC,France
    Umtiti,12,25,BFC,France
    Cortois,7,28,RM,Belgium
    Carvajal,7,28,RM,Spain
    Ramos,15,34,RM,Spain
    Varane,7,27,RM,France
    Marcelo,7,32,RM,Brazil
    Kroos,10,30,RM,Germany
    Modric,10,35,RM,Croatia
    """
def config:schema = {
    :name, "string";
    :salary, "int";
    :age, "int";
    :plays_for, "string";
    :nationality, "string"
    }
def csv = load_csv[config]
def delete[:player_csv] = player_csv
def insert[:player_csv] = csv
def output = player_csv

In practice, config:data would be a file uploaded to a notebook as a string relation, or config:path would be specified to point to a CSV file location. See the CSV Import How-To Guide for details.

Now we install some derived relations:

๐Ÿ‘‡

install

// install
 
def player(prop, name, val) =
    player_csv(prop, row, val)
    and player_csv(:name, row, name) from row
def name(x) = player(_, x, _)
def salary = player:salary
def age = player:age
def plays_for = player:plays_for
def nationality = player:nationality
def team(t) = player:plays_for(_, t) // set of teams

Once installed, these definitions are available for querying (see Installing Models). (For an explanation of _, which helps collect the set of teams in the team relation, see the Projection section below. For an explanation of from, see below. Expressions like player:salary use the Rel module syntax.)

๐Ÿ”Ž

player_csv is a base relation, created by the update transaction that loaded the CSV data and inserted it into player_csv, above. The relations player, name, salary, etc. are derived relations, sometimes known as views, based on the base relation data. The install transaction above makes them available to subsequent queries.

Since player is now installed, we can query it:

// query
 
def output = player
Loading player...

The Last Argument

When writing a table, it is natural to put the keys first and the values last: players and their age (as above); players and their salary; or graph edges and their weight.

Rel provides a number of common operations that operate on the last argument of a relation. For example, the standard library includes utilities for the max, min, sum, and average of a relation, taking the last argument of the relation as the value to be aggregated. The basic salary stats for our installed player relation can be computed as follows:

// query
 
def output = { (:sum, sum[salary]);
               (:average, average[salary]);
               (:count, count[salary]);
               (:argmax, argmax[salary]) }
 
Loading basic-salary-stats...

Note that these two numbers are different:

// query
 
def output = count[salary], count[x : salary(_, x)]
Loading two-numbers-are-different...

The first, count[salary] is the number of rows in the salary relation. The relation x : salary(_, x) contains the unique values found in the second argument. As different players have the same salary, the second number is smaller. This is important to keep in mind when computing averages and sums, as we will see below in the section on group-by and aggregations.

๐Ÿ”Ž

When computing aggregations, make sure you include keys, to avoid conflating equal values.

Aggregating Over the Empty Relation

In most cases, aggregating over an empty set gives an empty set, rather than, say, 0. For example, count[x in name : salary[x] < 0] is {}, rather than {0}. This is a design choice that simplifies the semantics of the language, and often results in more sparse intermediate data, where the default (0) does not have to be represented.

If we want to include the default, we can use the override operator, left_override, also known as <++, from the Rel Standard Library.

For example:

// query
 
def output = c in {"RM"; "BFC"; "Chelsea"} :
             sum[p where plays_for(p, c) : salary[p] ] <++ 0
Loading left-override-example...

Without <++ 0, the row for โ€œChelseaโ€ would not be included in the results. For more on <++, see the Advanced Syntax section of this Rel primer.

Bindings First or Bindings Last

Many Rel expressions result in a relation that has value keys first, and one or more metrics that follow. When writing these expressions in a Rel query or model, you can choose to have the bindings go first and the values (metrics) go last, or vice-versa, depending on what feels more natural. This will not change the result, as shown below.

Consider an example from Part I, which uses :, so the bindings go first and the values go last:

// query
 
def mydomain = range[1, 5, 1]
def output = x in mydomain, y in mydomain where x + y = 5 : x-y, x+y, x*y
Loading bindings-first-values-last...

You can move the bindings, which include the where and in constraints, to the other side of the : and still have an equivalent expression. The following section shows how to do this.

for, Also Known As |

The Rel construct for can be used instead of : to put the values first and the bindings last, which sometimes makes things easier to read โ€” for example, when the where condition is itself a complex expression. Note that the result is exactly the same.

// query
 
def mydomain = range[1, 5, 1]
def output = x-y, x+y, x*y for x in mydomain, y in mydomain where x+y = 5
Loading for-where...

Even though they read differently, these two definitions are equivalent. In both cases, the values of x and y will appear first in the result tuples.

Going back to aggregations, suppose you want to compute the average salary of players under 30.

By choosing to use : or for, Rel lets you put the condition first and the metric โ€” the value being aggregated, in this case, salary โ€” last, or the metric first and the condition last.

Condition first, metric second, using : :

// query
 
def output = average[x in name where age[x] < 30 : salary[x] ]
Loading condition-first-metric-second...

Metric first, condition last, using for (or its alias |) :

// query
 
def output = average[salary[x] for x in name where age[x] < 30]
Loading metric-first-condition-last...

The results will be the same.

In general, for an expression Expr and bindings b, Expr for b is equivalent to b : Expr . For a more mathematical notation, Expr for b can also be written as Expr | b. For example:

// query
 
def output = 100 * (x + y) | x in {1;2}, y in {1;3} where x + y = 3
Loading for-a-more-mathematical-notation...

You can read | as โ€œsuch thatโ€, remembering that the bound variables are included at the beginning of the resulting tuples, which is what you want for correct aggregation results.

๐Ÿ”Ž

In aggregations, use b : e (bindings at the left) or e for b (or e | b, bindings at the right) to make sure you are not conflating keys that have the same metric value. Use from only if you want to quantify away keys, and possibly remove duplicate values as a result.

๐Ÿ”Ž

Summary: e | b is equivalent to e for b, which is equivalent to b: e . In all three cases, the result keeps the tuples from b, making it safe for aggregation.

Group-By

The : operator lets you do group-by aggregations easily. For example, to see the average age for each team:

// query
 
def output = x in team : average[p where plays_for(p, x): age[p] ]
Loading group-by...

Note: The in team clause is not really needed, since plays_for constrains the values of x in the same way.

If you prefer, you can write this relation as:

def output[x in team] = average[p where plays_for(p, x): age[p] ]

As this is more readable, we adopt this style below.

To see the average salary and count, grouped by age:

// query
 
def output[a] = average[p where age(p, a) : salary[p]], count[p : age(p, a)]
Loading avg-salary-and-count-grouped-by-age...

Average salary and count, grouped by nationality:

// query
 
def output[n] = average[p where nationality(p, n) : salary[p]],
                count[p : nationality(p, n)]
Loading grouped-by-nationality...

for Keeps Variables, from Does Not

When aggregating, you will usually need for (or its equivalent, |). For example:

// query
 
def output:right = sum[salary[x] for x in name
                       where plays_for(x, "RM") and age[x] < 30]
def output:wrong = sum[salary[x] from x in name
                       where plays_for(x, "RM") and age[x] < 30]
Loading usually-want-to-use-for...

Module notation is used here to make right and wrong subrelations of output. See the Rel Modules concept guide.

There are three players satisfying the condition, all with a salary of 7. The first aggregation takes the sum of this relation:

// query
 
def output = salary[x] for x in name where plays_for(x, "RM") and age[x] < 30
Loading sum-of-this-relation...

The second aggregation, which existentially quantifies away the player, just takes the sum of the relation {7}.

๐Ÿ”Ž

Use for to keep the variables and from to existentially quantify them away.

Joining Relations

In database parlance, a join combines columns from different tables to build a new one, based on common elements in the rows of each table. In Rel, this corresponds to defining new relations, based on common elements between the tuples in the joined relations.

It is simple to join relations in Rel: and (or ,) will suffice.

For example, to get a list of players with their team and nationality, you can join the plays_for and nationality relations:

// query
 
def output(player, team, country) = plays_for(player, team) and
                                    nationality(player, country)
Loading join-plays-and-nationality-relations...

We can use from to existentially quantify away variables we do not want in the result (see the following section). For example, if we just want to see the nationalities playing for each team, we can write:

// query
 
def output(team, country) = plays_for(player, team) and
                            nationality(player, country) from player
Loading nationalities-playing-for-each-team...

Projection (Existential Quantification)

When manipulating relations, we often want to remove one or more columns and keep the others. We have already seen a special case of this, the partial relational application operator ([]), which additionally restricts the removed columns to have particular values. In general, we can use exists:

// query
 
def myrel = { (1, "a", 2); (1, "b", 3); (3, "a", 6); (4, "b", 12) }
def output(y) = exists(x, z : myrel(x, y, z))
Loading we-can-use-exists...

There are two other ways to indicate existential quantification: _ (underscore) and from.

// query
 
def myrel = { (1, "a", 2); (1, "b", 3); (3, "a", 6); (4, "b", 12) }
def output(x) = myrel(_, x, _)
Loading indicate-existential-quantification...

We can think of _ as a โ€œwildcardโ€ variable, which will match anything. More than one _ can be used, unrelated to each other.

The from construct also existentially quantifies away variables. This definition is equivalent to the previous one:

// query
 
def myrel = { (1, "a", 2); (1, "b", 3); (3, "a", 6); (4, "b", 12) }
def output(y) = myrel(x, y, z) from x, z
Loading existentially-quantifies-away...

Universal Quantification

Rel also supports universal quantification, provided the variable quantified over has a finite domain:

// query
 
def output("yes") = forall(x in {4;5;6} : x < 10)
Loading finite-domain...

In general, to restrict the domain being quantified over, use in to restrict single variables, and where to restrict combinations of variables:

// query
 
def output = if
     forall(x in {1; 2; 3}, y in {4; 5; 6} where x + y < 10 : x + y < 8)
     then "yes" else "no" end
Loading restrict-combinations-of-variables...

Following the rules of logic, if the domain is empty (false), the result is always true:

// query
 
def mydomain = {1; 2; 3; 4}
def output = if
    forall(x where mydomain(x) and x < 0 : 5 < 4)
    then "yes" else "no" end
Loading result-is-always-true...

Summary

This article has covered common database operations as expressed in Rel: aggregations, group-by aggregations, and joins. For more in this Rel Primer series, see Advanced Syntax.

Was this doc helpful?