Specialization

Rel allows you to specialize your relation to some of the values contained therein. These values become a part of the schema, rather than a part of the data. See Specialized Relations for more details.

In order to specialize a relation to some values, you must indicate these values. This is often referred to as specializing the values.

There are four principal ways to specialize a value:

You may use the operator #. For example, #17 is the specialized integer 17.
If the specialized value is a string, you may also use a Symbol. For example, both :name and :"name" are equivalent to #("name").
If you define a relation or a submodule inside some module declaration M, then the name of what you define is automatically converted to a Symbol. The module M — which is a relation — becomes specialized on the strings corresponding to these Symbols.
If you use load_csv to load data without specifying a schema in the configuration, then the names of columns in the first row are automatically converted to Symbols, and the resulting relation is specialized to the corresponding strings.

A specialized value is different from its nonspecialized form. In particular, the two are not equal: #17 != 17. You can compare specialized numbers for equality or inequality, but you cannot use other comparisons or arithmetic. For instance, the expression #17 = #17 evaluates to true, but #17 <= #17, #17 >= #17, and #10 + #7 are all invalid expressions that evaluate to false.

A specialized value can be converted to the original value using the Library relation despecialize. For example, despecialize[:name] yields "name". You can simulate operations on specialized numbers by using despecialize:

// read query
 
def the_sum = despecialize[#10] + despecialize[#7]
def output = #(the_sum)

Specialized Relations

If a relation contains specialized values, then the relation itself is specialized to those values. This means that the relation is actually represented internally as a number of different sets of tuples, a different one for each of the specialized values. You can think of such a relation as being partitioned into a number of smaller relations. The specialized values themselves are not represented as data, but form a part of the schemata of the different sets.

In the frequent case of specialized strings (Symbols), the name of the partition can be combined with the name of the relation. For instance, if the relation employees is specialized into two partitions for strings that describe two departments, "sales" and "marketing", then you can use employees:sales or employees:marketing as if they named separate relations. See Symbols (RelNames) for more details.

The remainder of this section contains an extended example to clarify this.

Consider the following:

// read query
 
def Data1 {
    "name", 1, "John";
    "age",  1, 10;
    "town", 1, "NYC";
    "name", 2, "Mary";
    "age",  2, 11;
    "town", 2, "Duluth"
}
def output = Data1

Compare the above with this:

// read query
 
def Data2 {
    :name, 1, "John";
    :age,  1, 10;
    :town, 1, "NYC";
    :name, 2, "Mary";
    :age,  2, 11;
    :town, 2, "Duluth"
}
def output = Data2

At first sight, the only difference is in the first column. The first output has strings, whereas the second has Symbols. But if you run these examples in the Console and choose to view the “physical” form of the output, you can see the real difference.

In the first example — with strings in the first column — there are actually two relations of arity 3. One of them stores the ages, and its schema is shown as /String/Int64/Int64. The other stores the names and towns, and its schema is shown as /String/Int64/String:

In the second example — with Symbols — there are actually three relations of arity 2, whose schemata are /:age/Int64/String, :name/Int64/String, and :town/Int64/String:

If the data contained information about, say, 10,000 people, then the first example would require space for 90,000 tuple elements, while the second would require space for only 60,000 tuple elements. The savings come from representing the equivalents of "name", "age", and "town" only once — in the schemata — and not 10,000 times.

The evaluation of some queries can also be more efficient. If you want the list of all names, then this is obtained from Data2 just by making a projection, that is, picking the information from the second column. With Data1, by contrast, the projection would have to be preceded by a selection operation that picks out tuples whose first element is "name".

Note that the second example above is very similar to what you obtain with:

// read query
 
def config:data = """
name,age,town
John,10,NYC
Mary,11,Duluth
"""
def output = load_csv[config]

There are two differences. First, the ages are strings. Second, the numbers that distinguish the data for different persons are somewhat different, and their type is FilePos, not Int64.

The important point is that the resulting relation is partitioned into a separate relation for each of the columns:

So you can specify the schema of your relation within your CSV data, which gives you great flexibility.

You can even dynamically build a new schema in your model, depending on your data. This is akin to the metaprogramming facility found in some programming languages, and is explored in the example below.

Suppose you have a large amount of data in the name, age, town format shown above. The people described in the data have only a limited number of different ages — they might be the students of middle schools, say. For your application it is important to quickly access data of people with a particular age, so you want the large relation to consist of different partitions for different ages. You also do not want partitions that are empty.

You want to be able to issue a query such as def output = data[#11, :name], knowing that it will access only the partition for age 11, specifically the subpartition that contains only tuples with file positions and names. Rel can evaluate a query like this without performing a selection operation that filters away unwanted tuples.

It is relatively straightforward to dynamically set up a schema that will satisfy these requirements. For testing the logic during development you would use only small data, as shown below:

// read query
 
def config:data = """
name,age,town
John,10,NYC
Mary,11,Duluth
Ann,10,LA
Jack,11,Carbondale
"""
def csv = load_csv[config]
// The ages as integers associated with corresponding strings
def ages_int_str(ageInt, ageStr) {
   ageInt = parse_int[ageStr] and ageStr = csv:age[_]
}
// The ages as integers (needed for specialization)
def ages_int = first[ages_int_str]
// The specialized ages
def ages_spec = #(ages_int)
// The specialized integer ages associated with file positions
def age_spec_to_pos(ageSpec, filePos) {
   ages_spec(ageSpec) and csv:age(filePos, ages_int_str[despecialize[ageSpec]])
}
// The reorganized data, partitioned primarily by ages, then by names/towns
def data(ageSpec, tag, filePos, value) {
   age_spec_to_pos(ageSpec, filePos) and csv(tag, filePos, value) and {:name; :town}(tag)
}
 
def output = data

In the physical view you can see that data has four partitions, as expected:

The example above takes advantage of Library operations parse_int, first, and despecialize. The relation ages_int had to be defined because the specialization operator # can be used only on expressions of limited forms.

Having dynamically created the schema, you can query it. For example, you might want to find out the set of specialized ages that determine the partitions in your data:

def output = first[data]

Using the Specialization Operator

Not every expression can be specialized by applying the # operator.

The expressions that can be specialized with # are:

There are some caveats that will be discussed below.

An application of the operator # to a variable or to a specialized value will evaluate to false.

Specializing Literals

The specializing operator # can be applied to a literal of any type. The literal must be enclosed in parentheses, unless it is an integer literal.

For example:

// read query
 
def output = #17
def output = #(0)
def output = #(3.14)
def output = #(decimal[128, 20, sqrt[2]])
def output = #("Hello!")

As always, the physical view allows you to verify that the literals have been specialized.

Specializing Constant Expressions

You can apply the specializing operator # to a simple expression that can be evaluated to a constant at compile time.

For example:

// read query
 
def seventeen = 17
 
def output = #(11 + 6)
def output = #(decimal[128, 20, sqrt[2]])
def output = #("number %(17) is fine")
def output = #("another %(seventeen)")

However, you will get errors if the expressions involve relations that are not unary singletons, as in the following examples:

def many = 1; 2
def output = #("many %(many)")
def output = #(10 + many)

To overcome this limitation, see the end of Specializing Unary Relations.

Specializing Unary Relations

You can specialize all the values in a unary relation with one operation:

// read query
 
entity type Ent = Int, String
 
value type Point = Float, Float, Float
 
def R = 1, "test";
        2, decimal[128, 20, sqrt[2]];
        3, 2021-10-12T01:22:31+10:00;
        4, :relname;
        5, boolean_true;
        6, ^Ent[7, "HO!"];
        7, ^Point[1.0, 2.1, 3.2]
def output = #(R[_])

The long sequence of digits is the hash value that represents the entity. Notice that specialization of a specialized value produces no result, so :relname is silently dropped.

In the example above, the unary relation was obtained by projecting away the first column of a binary relation. However, if you try to achieve a similar effect by writing #(last[R]) or #(first[transpose[R]]), the result will be empty and you will get an error message.

Note that you cannot specialize a variable. If the definition of output above took the seemingly equivalent form shown below, there would be no output:

def output = y : R(_, x) and y = #(x) from x

Finally, if you want to specialize values of expressions that cannot be specialized immediately, such as the ones at the end of Specializing Constant Expressions, you can specialize them after first storing them in a unary relation:

// read query
 
def many = 1; 2
def tmp = "many %(many)"
def tmp = 10 + many
def output = #(tmp)

Next: Integrity Constraints

Was this doc helpful?