Working With CSV Data

This concept guide explains how to work with CSV data in Rel ranging from the CSV representation and data import to querying, basic exploratory data analysis (EDA), and data manipulation.

Introduction

The CSV data format is widely used across database management systems. As you will see, Rel provides a number of functionalities that facilitate working with CSV data, including loading, manipulating, and visualizing them.

Note that this guide complements the CSV Import, CSV Export, and Accessing the Cloud guides, where you can find all the relevant information for importing and exporting CSV data. Rel’s CSV functionality is based on RFC 4180 (opens in a new tab).

CSV Data Schema

In this section, you will learn about the representation and schema of CSV data. As an example, consider the following CSV data:

country,drink
Spain,Gazpacho
Spain,Sangría
Argentina,"Yerba Mate"
United States,"Coca-Cola"
Italy,"San Pellegrino"

You can load this data from the cloud using the load_csv command. Here’s an example that loads the data from the cloud:

// read query
 
def cloud_csv = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]
 
def output = ::std::display::table[cloud_csv]

To illustrate how data is represented in a relational form you can compare the data loaded from the cloud to manually loaded data. You can load the same data manually into the RKGS by creating a module as follows:

// read query
 
module my_csv
    def country = {
        (1, "Spain");
        (2, "Spain");
        (3, "Argentina");
        (4, "United States");
        (5, "Italy");
    }
 
    def drink = {
        (1, "Gazpacho");
        (2, "Sangria");
        (3, "Yerba Mate");
        (4, "Coca Cola");
        (5, "San Pellegrino");
    }
end
 
def output = ::std::display::table[my_csv]

The module above has two columns (namely, country and drink), and uses keys (i.e., the integers) to connect each drink to its country of origin. Note that the relation table provides a more concise output. See Tabular Form.

Note that the two data outputs are almost equivalent. The difference is that the manually defined module explicitly used an integer for the keys while load_csv automatically generated the integer value. This automatically generated value refers to the row number of each entry in the CSV file.

You can see this by exploring the loaded data but without using table:

// read query
 
def cloud_csv = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]
 
def output = cloud_csv

The relation created by load_csv contains triples of the form (colname, filepos, value).

Rel stores data in a similar way, using Graph Normal Form (GNF). GNF is a fully normalized representation that leverages modularity and specialization.

Note that in the case of CSV data:

The variable colname refers to the CSV column names and is of type Symbol.
The variable filepos is the record identifier and is of type FilePos. It uniquely identifies the row of a given CSV file.
The variable v is the entry in the cell with column colname and row filepos.

CSV data are stored in a relational form, where each CSV column is essentially a separate relation with a well-defined type. The module — for example, my_csv — effectively acts as a container that represents the CSV data.

Once imported, the raw CSV data are schema-mapped and are further transformed to align with the desired data model. It’s advisable to use GNF to enhance modeling and achieve effective reasoning.

Import and Export

This section explains how to import and export a CSV dataset so you can later work with its data.

Dataset

This guide uses part of the car dataset (opens in a new tab) by Petra Isenberg, Pierre Dragicevic, and Yvonne Jansen. Start by loading the data so you can use them for different queries and other operations in the rest of the guide.

Importing the Data

The cars.csv dataset contains 406 car instances with nine attributes each.

The cars.csv file contains 408 rows because the first two rows are the column names and their types:

Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT

When loading the data, the first row is used for the column names. You will later remove the second row as you can’t directly use it within Rel. Instead, you can specify the data types for each column when loading the data. Finally, since the data are separated by semicolons, you have to specify this delimiter.

You can load the data in a base relation called my_data as follows:

// write query
 
// Set the import configuration.
module config
    def path = "azure://raidocs.blob.core.windows.net/datasets/cars/cars.csv"
 
    def schema = {
        (:Car, "string");
        (:MPG, "float");
        (:Cylinders, "int")
        (:Displacement, "float");
        (:Horsepower, "float");
        (:Weight, "float");
        (:Acceleration, "float");
        (:Model, "int");
        (:Origin, "string");
    }
 
    // Set the delimiter.
    def syntax:delim = ';'
end
 
def insert:my_data = load_csv[config]
 
def output = ::std::display::table[my_data]

For more details on the various parameters for loading CSV data, see CSV Import. Additionally, in certain cases you may come across errors when loading the data. These errors are documented in the Loading Errors section. Loading errors are also part of the module and need attention. Typically, if you want to store the data as a base relation it’s best to filter out the Loading Errors first.

Exporting the Data

In addition to importing, you can also export CSV data to the cloud. For this, you need to represent the data in the same way, i.e., in triples of the form (colname, filepos, value). You can export data to the cloud using the export_csv command. For more details on exporting CSV data, see CSV Export.

Queries

You can use Rel to query the data. In the next sections you will go over some example queries.

Filtering Columns

You can select which columns to display from the data. Here’s an example showing only the values from the Car, Acceleration and MPG columns:

// read query
 
def output = ::std::display::table[my_data[col] for col in {:Car; :Acceleration; :MPG}]

Filtering Rows

Filtering Within One Column

For the first example, say that you want to see the cars that were introduced in the 70’s:

// read query
 
def rows_to_keep(row) {
    exists(m :
        my_data:Model(row, m)
        and m = 70
    )
}
 
def output = ::std::display::table[my_data(col, row, val) and rows_to_keep(row) for col, row, val]

Filtering Within Multiple Columns

Next, say you’re interested in finding all the car names with acceleration 0 to 62 mph of 10 seconds or less. You can write this query as follows:

// read query
 
def output(car) {
    exists(row, acc:
        my_data:Car(row, car)
        and my_data:Acceleration(row, acc)
        and acc <= 10.0
    )
}

In Rel, relations are sets, so if you list the same car more than once in the dataset — for example, in two different countries — it would appear only once in the output.

Next, you can identify the displacement of all Toyota cars:

// read query
 
def output(car, disp) {
    exists(row:
        my_data:Car(row, car)
        and my_data:Displacement(row, disp)
        and like_match("Toyota\%", car)
    )
}

Essentially, each query that involves multiple columns is effectively a join because each CSV column is stored in the database as a separate relation.

Aggregation and Group By

Rel also supports aggregate queries. Here’s an example that displays the number of cars per origin:

// read query
 
def output[origin] = count[row : my_data:Origin(row, origin)]

You can also find the heaviest car:

// read query
 
def car_weight(car, weight) {
    exists(row:
        my_data:Car(row, car)
        and my_data:Weight(row, weight)
    )
}
 
def output = argmax[car_weight]

Point-Wise and Point-Free Syntax

You can use Rel to express your queries and data models in two different styles.

Point-Free Syntax

The first style is called point-free syntax. This style closely resembles the approach of popular data science languages, such as Python. Here, the variables are implied, i.e., you can omit the variables as long as there is no ambiguity in the model.

Here is an example of a point-free query that returns the brands of all cars:

def output = my_data:Car

Point-Wise Syntax

The second style is point-wise syntax, which is more similar to logical programming. In this case, all the variables are explicitly stated. Here’s an example of a point-wise query, also returning the brands of all cars:

def output(row, car) = my_data:Car(row, car)

Explicitly stating the variables, like in the the point-wise syntax, is often required when performing joins on a shared variable or adding additional conditions to a variable — for instance, see Filtering Within One Column. For more details on querying data, see My First Rel Program, Rel Primer: Basic Syntax, and Rel Primer: Advanced Syntax.

Data Exploration

The imported relation my_data is a Rel module. You will now explore the data using simple examples.

Tabular Form

One useful way to take a look at the data is through the lens of the table relation:

// read query
 
def output = ::std::display::table[my_data]

You can also select which columns to display in a tabular format. Here’s an example showing only the values from the Car and MPG columns:

// read query
 
def output = ::std::display::table[my_data[col] for col in {:Car; :MPG}]

This is the familiar tabular (or unstacked) format of the data. Essentially, table displays the GNF relation my_data as a (wide table](/rel/concepts/graph-normal-form#wide-versus-tall-tables).

Exploring the Data

You can start by examining some small snippets of the data. The following code displays the top five rows of the data:

// read query
 
@ondemand @outline
def top_rows[k, R](col, row, val) {
    exists(order:
        R(col, row, val)
        and sort[second[R]](order, row)
        and order <= k
    )
}
 
def output = ::std::display::table[top_rows[5, my_data]]

Note that you only see the top four data rows as one of the top data rows is a loading error that you will explore in Loading Errors.

Similarly, the following code displays the bottom five rows of the data:

// read query
 
@ondemand @outline
def bottom_rows[k, R](col, row, val) {
    exists(order:
        R(col, row, val)
        and reverse_sort[second[R]](order, row)
        and order <= k
    )
}
 
def output = ::std::display::table[bottom_rows[5, my_data]]

Considering the triples’ form, the first row of the data indicates that the value of the Acceleration column in row 3 of the imported CSV file is 12.0. Similarly, the Weight column in row 408 of the imported CSV file has the value 2720.0.

🔎

Simply using top and bottom may lead to unexpected results as these two commands sort the relation data in lexicographic order before displaying the top or bottom rows.

Finding Outliers

Rel allows you to explore specific columns of the data. For example, sort — and its counterpart reverse_sort — are higher-order relations that sort data. Combining sort with top and bottom allows for a simple data exploration that can help identify outliers.

Here are, for example, the bottom and top ten values of the Horsepower column in the my_data relation:

// read query
 
def hp = sort[hpval, row : my_data:Horsepower(row, hpval)]
 
def output(hpval, row) { bottom[10, hp](_, _, hpval, row) }

// read query
 
def hp = sort[hpval, row : my_data:Horsepower(row, hpval)]
 
def output(hpval, row) { top[10, hp](_, _, hpval, row) }

As you can see, there are six rows that have a value of 0 in the Horsepower column.

Data Cleaning

Loading Errors

In some cases, you may get loading errors when loading CSV data. For more details, see Import Errors in the CSV Import guide. Note that in most cases these errors refer to malformed data.

You can check for loading errors through the :load_errors field of the loaded relation my_data. More specifically:

// read query
 
def output = my_data:load_errors

In the loaded cars.csv file, the second row contains the data types of the data. This row, however, is not useful as you have already specified the data type of each column when importing the data. You can remove such error lines as follows:

// model
 
def my_data_no_errors(col, row, entry...) {
    my_data(col, row, entry...) and not my_data:load_errors(row, _, _)
}

// read query
 
def output = my_data_no_errors:load_errors

Note that now my_data_no_errors does not contain the load_errors that were present in my_data.

You can also check the header_row (opens in a new tab) option that provides an alternative way for removing unnecessary rows from the beginning of a CSV file.

Unwanted/Incorrect Values

You can consider that the 0 values in the Horsepower column of the CSV data are unwanted and should be removed.

You can perform data cleaning in a way similar to fixing the loading errors. More specifically, the following code identifies the rows that have a 0 in the Horsepower column and removes those rows from the data:

// model
 
def rows_to_remove(row) {
    exists(
        hp : my_data_no_errors:Horsepower(row, hp)
        and hp = 0.0
    )
}
 
def my_data_fix_hp(col, row, entry...) {
    my_data_no_errors(col, row, entry...)
    and not rows_to_remove(row)
}

// read query
 
def hp = sort[hpval, row : my_data_fix_hp:Horsepower(row, hpval)]
 
def output(hpval, row) { top[10, hp](_, _, hpval, row) }

The rows with horsepower equal to 0 have now been removed.

Note that there is a similar situation (i.e., rows with a 0 value) in the MPG column as well. You can keep those for now to use them later in Imputing Data.

Parameterized Data Cleaning

Data cleaning and preparation are often a large part of the workflow when working with CSV data. Rel allows you to group the cleaning operations and parameterize them using modules. Here’s an example that uses parameterized modules and performs the same two operations, i.e., removing loading errors and rows with 0 in the Horsepower column:

// model
 
@ondemand @outline
def clean[D, C](col, row, v) {
    D(col, row, v)
    and not C:remove_cols(col)
    and not C:remove_rows:rows_to_remove(row)
}
 
@inline
module clean_ops[D]
    module remove_cols
        def load_errors = true
    end
 
    module remove_rows
        def rows_to_remove(row) {
            exists(
                hp : D:Horsepower(row, hp)
                and hp = 0.0
            )
        }
    end
end
 
def my_data_clean = clean[my_data, clean_ops[my_data]]

// read query
 
def output = ::std::display::table[my_data_clean]

Data Manipulation

Views

You started this guide by importing data from the cloud in the my_data relation. In the case of my_data, the data are actually inserted into the database since you used the control relation insert when importing the data. A relation whose tuples are inserted into the database, such as my_data in this example, is called a base relation.

After creating the my_data relation, you used different operations to remove loading errors or clean some unwanted values from the data. One example of a relation that does this is the my_data_clean relation. This relation, however, operates as a view over the data, i.e., it does not actually store the changed data on disk. Instead, the RKGS remembers the definition of my_data_clean. In this case, my_data_clean is a derived relation.

For more details on the functionality and differences of these two kinds of relations, see Base and Derived Relations in Rel Primer: Basic Syntax.

When you want to remove or change the data from the system, you should do so in the base relation. In this case, all derived relations are automatically updated to reflect the change.

🔎

Note that adding or changing data in a derived relation won’t work since the new data will essentially be a union of the new data with the existing data in the derived relation. The logic for changing data in a derived relation should be installed or defined in a query. For base relations, changing data should be a write transaction.

For more details, see Working With Base Relations.

Adding a Column

You can use Rel to compute new columns over the data that may be useful for your analysis. As an example, the cars.csv file that you loaded earlier has two columns, Horsepower and Weight, but it doesn’t have the power-to-weight ratio (PWR) column that measures the total power of a vehicle relative to its total weight.

You can compute the power-to-weight ratio using a point-free approach as follows:

// write query
 
def my_data_clean:PWR[row] = my_data_clean:Horsepower[row] / my_data_clean:Weight[row]
 
def output = ::std::display::table[my_data_clean]

The output displays the columns in alphabetical order based on their names, which places the PWR column at the far right of the table.

Alternatively, you can use the point-wise syntax to compute the PWR column:

def my_data_clean:PWR(row, pwr) {
    exists(hp, w:
        my_data_clean:Horsepower(row, hp)
        and my_data_clean:Weight(row, w)
        and pwr = hp / w
    )
}

Deleting a Column

You can also delete a column in a similar way. For example, here’s how you can delete the newly made column PWR:

// write query
 
def delete:my_data_clean:PWR = my_data_clean:PWR
 
def output = ::std::display::table[my_data_clean]

As you can see, the PWR column has now been removed.

Adding a Row

You can add a row in the data by providing the values needed for all the different columns. Since the my_data relation is stored in Graph Normal Form, you also need to specify a value for the row number.

The following code adds a row with the row value of 409 in the base relation my_data. This is equivalent to appending a row at the end of the file:

// write query
 
def row_val = 409
 
module new_data
    def Acceleration = { (row_val, 4.9); }
    def Car = { (row_val, "Nissan GTR Skyline R34"); }
    def Cylinders = { (row_val, 6); }
    def Displacement = { (row_val, 2600.0); }
    def Horsepower = { (row_val, 276); }
    def MPG = { (row_val, 16.0); }
    def Model = { (row_val, 98); }
    def Origin = { (row_val, "Japan"); }
    def Weight = { (row_val, 439.2); }
end
 
def insert:my_data = new_data

// read query
 
def output = ::std::display::table[my_data_clean]

You can see the new row at the very end of the output.

Deleting a Row

Here is an example of how to delete a row:

// write query
 
def to_delete_row(row) {
    exists(car:
        my_data:Car(row, car)
        and like_match("Nissan GTR\%", car)
    )
}
 
def delete:my_data(col, row, val) {
    my_data(col, row, val)
    and to_delete_row(row)
}

// read query
 
def output = ::std::display::table[my_data_clean]

The row from the very end of the file has now been deleted.

Changing Values

In some cases, it’s useful to change the data. For example, consider the values for the Origin column in the data:

// read query
 
def output(origin) { my_data:Origin(_, origin) }

Now, say you’re interested in spelling out the US acronym so that it’s consistent with the other two origins, i.e., Europe and Japan.

You can change one single value by specifying the given row and setting the new value. Here’s an example that changes the origin in row 3 in the data to United States:

// write query
 
def delete:my_data:Origin[3] = my_data:Origin[3]
def insert:my_data:Origin[3] = "United States"

🔎

Note that because my_data_clean is a derived relation, you need to update the base relation my_data. The changes propagate to my_data_clean. For more details, see Working With Base Relations.

// read query
 
def output = ::std::display::table[my_data_clean[col] for col in {:Car; :Origin;}]

You can now see that the origin value in row 3 has changed to United States. In Rel, you can also change multiple values at once.

Here’s an example that changes all the remaining US values in the Origin column to United States:

// write query
 
def insert:my_data:Origin(row, orig) {
    my_data:Origin(row, "US")
    and orig = "United States"
}

// write query
 
def delete:my_data:Origin(row, orig) {
    my_data:Origin(row, orig)
    and orig = "US"
}

Note that the delete needs to happen after the insert and thus you need to perform two separate transactions.

The data in the relation my_data_clean have now been changed accordingly. The next query displays all different origins in the data:

// read query
 
def output(origin) = my_data_clean:Origin(_, origin)

Imputing Data

Rel also provides the functionality for performing data imputation. For this example, consider again the 0 values that were discussed earlier in the MPG column:

// read query
 
def output(row, mpgval) {
    my_data_clean:MPG(row, mpgval)
    and mpgval = 0.0
}

As these values are most likely incorrect, you can use Rel to impute the data. There are several approaches to this. One way is to replace all 0 MPG with the overall average MPG value. Here’s the code that does this:

// write query
 
// Get non-zero `mpgs`.
def nz_mpgs(row, mpgval) {
    my_data_clean:MPG(row, mpgval)
    and mpgval != 0.0
}
 
// Compute the average.
def nz_mpgs_avg = mean[nz_mpgs]
 
// Impute.
def insert:my_data:MPG(row, val) {
    my_data:MPG(row, 0.0)
    and val = nz_mpgs_avg
}

// write query
 
def delete:my_data:MPG(row, val) {
    my_data:MPG(row, val)
    and val = 0.0
}

// read query
 
def output(row, mpgval) {
    my_data_clean:MPG(row, mpgval)
    and mpgval = 0.0
}

There are now no 0 MPG values in the data.

In this section, you imputed certain values, i.e., you replaced certain values with another computed number. You can follow a similar approach for values that are missing. For more details, see Missing Values in the Graph Normal Form concept guide.

Data Visualization

Rel allows you to visualize data through the Vega-Lite (opens in a new tab) library. You can plot different views of your data using the various visualization capabilities.

For example, you can plot a histogram of the number of cars in each Origin. The data contain cars from three different origins, namely Europe, Japan, and US.

The Vega-Lite library requires the data in a slightly different form. First, the data need to be numbered consecutively. You can do this using the lined_csv functionality. Second, it requires the data to be in array format. Here’s the model to perform these operations:

// model
 
def my_data_lined = lined_csv[my_data_clean]
def my_data_graph[:[], i, col] = my_data_lined[col, i]

You can now plot a histogram of the number of cars per origin:

// read query
 
// Assign the data.
def chart:data:values = my_data_graph
 
// Set up the chart.
def chart:mark:type = "bar"
def chart:mark:tooltip = boolean_true
 
def chart = vegalite_utils:x[{
  (:field, "Origin");
  (:title, "Origin");
  (:type, "ordinal");
  (:axis, {
    (:labelAngle, 45);
    (:ticks, boolean_true);
    (:grid, boolean_true);
  })
}]
 
def chart = vegalite_utils:y[{
  (:aggregate, "count");
  (:type, "quantitative");
}]
 
// Display.
def output = ::std::display::vegalite::plot[chart]

Similarly, you can check the correlation between horsepower and acceleration by creating a scatter plot using Vega-Lite. You can additionally color the different points in the plot using different colors per origin. Here’s the code to create this chart:

// read query
 
// Assign the data.
def chart:data:values = my_data_graph
 
// Set up the chart.
def chart:mark = "point"
 
def chart = vegalite_utils:x[{
  (:field, "Horsepower");
  (:title, "Horsepower (hp)");
  (:type, "quantitative");
  (:scale, :zero, boolean_false);
}]
 
def chart = vegalite_utils:y[{
  (:field, "Acceleration");
  (:title, "Acceleration 0-60mph (sec)");
  (:type, "quantitative");
  (:scale, :zero, boolean_false);
}]
 
def chart = vegalite_utils:color[{
  (:field, "Origin");
  (:type, "nominal");
  (:scale, :domain, :[], {(1, "United States"); (2, "Japan"); (3, "Europe")});
  (:title, "Origin");
 
}]
 
// Display.
def output = ::std::display::vegalite::plot[chart]

Summary

You have learned how to work with CSV data using Rel by loading, exploring, and manipulating data and their schema. Using the examples in this guide, you are now able to import various forms of CSV data into your RAI database and work with the data. See also the CSV Import, CSV Export, and JSON Import guides for more information.

Was this doc helpful?