Working With CSV Data
This concept guide explains how to work with CSV data in Rel ranging from the CSV representation and data import to querying, basic exploratory data analysis (EDA), and data manipulation.
Introduction
The CSV data format is widely used across database management systems. As you will see, Rel provides a number of functionalities that facilitate working with CSV data, including loading, manipulating, and visualizing them.
Note that this guide complements the CSV Import, CSV Export, and Accessing the Cloud guides, where you can find all the relevant information for importing and exporting CSV data. Rel’s CSV functionality is based on RFC 4180 (opens in a new tab).
CSV Data Schema
In this section, you will learn about the representation and schema of CSV data. As an example, consider the following CSV data:
country,drink
Spain,Gazpacho
Spain,Sangría
Argentina,"Yerba Mate"
United States,"Coca-Cola"
Italy,"San Pellegrino"
You can load this data from the cloud using the load_csv
command.
Here’s an example that loads the data from the cloud:
// read query
def cloud_csv = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]
def output = ::std::display::table[cloud_csv]
To illustrate how data is represented in a relational form you can compare the data loaded from the cloud to manually loaded data. You can load the same data manually into the RKGS by creating a module as follows:
// read query
module my_csv
def country = {
(1, "Spain");
(2, "Spain");
(3, "Argentina");
(4, "United States");
(5, "Italy");
}
def drink = {
(1, "Gazpacho");
(2, "Sangria");
(3, "Yerba Mate");
(4, "Coca Cola");
(5, "San Pellegrino");
}
end
def output = ::std::display::table[my_csv]
The module above has two columns (namely, country
and drink
), and uses keys (i.e., the integers) to connect each drink to its country of origin.
Note that the relation table
provides a more concise output.
See Tabular Form.
Note that the two data outputs are almost equivalent.
The difference is that the manually defined module explicitly used an integer for the keys while load_csv
automatically generated the integer value.
This automatically generated value refers to the row number of each entry in the CSV file.
You can see this by exploring the loaded data but without using table
:
// read query
def cloud_csv = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]
def output = cloud_csv
The relation created by load_csv
contains triples of the form (colname, filepos, value)
.
Rel stores data in a similar way, using Graph Normal Form (GNF). GNF is a fully normalized representation that leverages modularity and specialization.
Note that in the case of CSV data:
- The variable
colname
refers to the CSV column names and is of typeSymbol
. - The variable
filepos
is the record identifier and is of typeFilePos
. It uniquely identifies the row of a given CSV file. - The variable
v
is the entry in the cell with columncolname
and rowfilepos
.
CSV data are stored in a relational form, where each CSV column is essentially a separate relation with a well-defined type.
The module — for example, my_csv
— effectively acts as a container that represents the CSV data.
Once imported, the raw CSV data are schema-mapped and are further transformed to align with the desired data model. It’s advisable to use GNF to enhance modeling and achieve effective reasoning.
Import and Export
This section explains how to import and export a CSV dataset so you can later work with its data.
Dataset
This guide uses part of the car dataset (opens in a new tab) by Petra Isenberg, Pierre Dragicevic, and Yvonne Jansen. Start by loading the data so you can use them for different queries and other operations in the rest of the guide.
Importing the Data
The cars.csv
dataset contains 406 car instances with nine attributes each.
The cars.csv
file contains 408 rows because the first two rows are the column names and their types:
Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT
When loading the data, the first row is used for the column names. You will later remove the second row as you can’t directly use it within Rel. Instead, you can specify the data types for each column when loading the data. Finally, since the data are separated by semicolons, you have to specify this delimiter.
You can load the data in a base relation called my_data
as follows:
// write query
// Set the import configuration.
module config
def path = "azure://raidocs.blob.core.windows.net/datasets/cars/cars.csv"
def schema = {
(:Car, "string");
(:MPG, "float");
(:Cylinders, "int")
(:Displacement, "float");
(:Horsepower, "float");
(:Weight, "float");
(:Acceleration, "float");
(:Model, "int");
(:Origin, "string");
}
// Set the delimiter.
def syntax:delim = ';'
end
def insert:my_data = load_csv[config]
def output = ::std::display::table[my_data]
For more details on the various parameters for loading CSV data, see CSV Import. Additionally, in certain cases you may come across errors when loading the data. These errors are documented in the Loading Errors section. Loading errors are also part of the module and need attention. Typically, if you want to store the data as a base relation it’s best to filter out the Loading Errors first.
Exporting the Data
In addition to importing, you can also export CSV data to the cloud.
For this, you need to represent the data in the same way, i.e., in triples of the form (colname, filepos, value)
.
You can export data to the cloud using the
export_csv
command.
For more details on exporting CSV data, see CSV Export.
Queries
You can use Rel to query the data. In the next sections you will go over some example queries.
Filtering Columns
You can select which columns to display from the data.
Here’s an example showing only the values from the Car
, Acceleration
and MPG
columns:
// read query
def output = ::std::display::table[my_data[col] for col in {:Car; :Acceleration; :MPG}]
Filtering Rows
Filtering Within One Column
For the first example, say that you want to see the cars that were introduced in the 70’s:
// read query
def rows_to_keep(row) {
exists(m :
my_data:Model(row, m)
and m = 70
)
}
def output = ::std::display::table[my_data(col, row, val) and rows_to_keep(row) for col, row, val]
Filtering Within Multiple Columns
Next, say you’re interested in finding all the car names with acceleration 0
to 62
mph of 10
seconds or less.
You can write this query as follows:
// read query
def output(car) {
exists(row, acc:
my_data:Car(row, car)
and my_data:Acceleration(row, acc)
and acc <= 10.0
)
}
In Rel, relations are sets, so if you list the same car more than once in the dataset — for example, in two different countries — it would appear only once in the output.
Next, you can identify the displacement of all Toyota cars:
// read query
def output(car, disp) {
exists(row:
my_data:Car(row, car)
and my_data:Displacement(row, disp)
and like_match("Toyota\%", car)
)
}
Essentially, each query that involves multiple columns is effectively a join because each CSV column is stored in the database as a separate relation.
Aggregation and Group By
Rel also supports aggregate queries. Here’s an example that displays the number of cars per origin:
// read query
def output[origin] = count[row : my_data:Origin(row, origin)]
You can also find the heaviest car:
// read query
def car_weight(car, weight) {
exists(row:
my_data:Car(row, car)
and my_data:Weight(row, weight)
)
}
def output = argmax[car_weight]
Point-Wise and Point-Free Syntax
You can use Rel to express your queries and data models in two different styles.
Point-Free Syntax
The first style is called point-free syntax. This style closely resembles the approach of popular data science languages, such as Python. Here, the variables are implied, i.e., you can omit the variables as long as there is no ambiguity in the model.
Here is an example of a point-free query that returns the brands of all cars:
def output = my_data:Car
Point-Wise Syntax
The second style is point-wise syntax, which is more similar to logical programming. In this case, all the variables are explicitly stated. Here’s an example of a point-wise query, also returning the brands of all cars:
def output(row, car) = my_data:Car(row, car)
Explicitly stating the variables, like in the the point-wise syntax, is often required when performing joins on a shared variable or adding additional conditions to a variable — for instance, see Filtering Within One Column. For more details on querying data, see My First Rel Program, Rel Primer: Basic Syntax, and Rel Primer: Advanced Syntax.
Data Exploration
The imported relation my_data
is a Rel module.
You will now explore the data using simple examples.
Tabular Form
One useful way to take a look at the data is through the lens of the table
relation:
// read query
def output = ::std::display::table[my_data]
You can also select which columns to display in a tabular format.
Here’s an example showing only the values from the Car
and MPG
columns:
// read query
def output = ::std::display::table[my_data[col] for col in {:Car; :MPG}]
This is the familiar tabular (or unstacked) format of the data.
Essentially, table
displays the GNF relation my_data
as a (wide table](/rel/concepts/graph-normal-form#wide-versus-tall-tables).
Exploring the Data
You can start by examining some small snippets of the data. The following code displays the top five rows of the data:
// read query
@ondemand @outline
def top_rows[k, R](col, row, val) {
exists(order:
R(col, row, val)
and sort[second[R]](order, row)
and order <= k
)
}
def output = ::std::display::table[top_rows[5, my_data]]
Note that you only see the top four data rows as one of the top data rows is a loading error that you will explore in Loading Errors.
Similarly, the following code displays the bottom five rows of the data:
// read query
@ondemand @outline
def bottom_rows[k, R](col, row, val) {
exists(order:
R(col, row, val)
and reverse_sort[second[R]](order, row)
and order <= k
)
}
def output = ::std::display::table[bottom_rows[5, my_data]]
Considering the triples’ form, the first row of the data indicates that the value of the Acceleration
column in row 3 of the imported CSV file is 12.0
.
Similarly, the Weight
column in row 408 of the imported CSV file has the value 2720.0
.
Finding Outliers
Rel allows you to explore specific columns of the data.
For example, sort
— and its counterpart reverse_sort
— are higher-order relations that sort data.
Combining sort
with top
and bottom
allows for a simple data exploration that can help identify outliers.
Here are, for example, the bottom and top ten values of the Horsepower
column in the my_data
relation:
// read query
def hp = sort[hpval, row : my_data:Horsepower(row, hpval)]
def output(hpval, row) { bottom[10, hp](_, _, hpval, row) }
// read query
def hp = sort[hpval, row : my_data:Horsepower(row, hpval)]
def output(hpval, row) { top[10, hp](_, _, hpval, row) }
As you can see, there are six rows that have a value of 0
in the Horsepower
column.
Data Cleaning
Loading Errors
In some cases, you may get loading errors when loading CSV data. For more details, see Import Errors in the CSV Import guide. Note that in most cases these errors refer to malformed data.
You can check for loading errors through the :load_errors
field of the loaded relation my_data
.
More specifically:
// read query
def output = my_data:load_errors
In the loaded cars.csv
file, the second row contains the data types of the data.
This row, however, is not useful as you have already specified the data type of each column when importing the data.
You can remove such error lines as follows:
// model
def my_data_no_errors(col, row, entry...) {
my_data(col, row, entry...) and not my_data:load_errors(row, _, _)
}
// read query
def output = my_data_no_errors:load_errors
Note that now my_data_no_errors
does not contain the load_errors
that were present in my_data
.
You can also check the header_row
(opens in a new tab) option that provides an alternative way for removing unnecessary rows from the beginning of a CSV file.
Unwanted/Incorrect Values
You can consider that the 0
values in the Horsepower
column of the CSV data are unwanted and should be removed.
You can perform data cleaning in a way similar to fixing the loading errors.
More specifically, the following code identifies the rows that have a 0
in the Horsepower
column and removes those rows from the data:
// model
def rows_to_remove(row) {
exists(
hp : my_data_no_errors:Horsepower(row, hp)
and hp = 0.0
)
}
def my_data_fix_hp(col, row, entry...) {
my_data_no_errors(col, row, entry...)
and not rows_to_remove(row)
}
// read query
def hp = sort[hpval, row : my_data_fix_hp:Horsepower(row, hpval)]
def output(hpval, row) { top[10, hp](_, _, hpval, row) }
The rows with horsepower equal to 0
have now been removed.
Note that there is a similar situation (i.e., rows with a 0
value) in the MPG column as well.
You can keep those for now to use them later in Imputing Data.
Parameterized Data Cleaning
Data cleaning and preparation are often a large part of the workflow when working with CSV data.
Rel allows you to group the cleaning operations and parameterize them using modules.
Here’s an example that uses parameterized modules and performs the same two operations, i.e., removing loading errors and rows with 0
in the Horsepower
column:
// model
@ondemand @outline
def clean[D, C](col, row, v) {
D(col, row, v)
and not C:remove_cols(col)
and not C:remove_rows:rows_to_remove(row)
}
@inline
module clean_ops[D]
module remove_cols
def load_errors = true
end
module remove_rows
def rows_to_remove(row) {
exists(
hp : D:Horsepower(row, hp)
and hp = 0.0
)
}
end
end
def my_data_clean = clean[my_data, clean_ops[my_data]]
// read query
def output = ::std::display::table[my_data_clean]
Data Manipulation
Views
You started this guide by importing data from the cloud in the my_data
relation.
In the case of my_data
, the data are actually inserted into the database since you used the control relation insert
when importing the data.
A relation whose tuples are inserted into the database, such as my_data
in this example, is called a base relation.
After creating the my_data
relation, you used different operations to remove loading errors or clean some unwanted values from the data.
One example of a relation that does this is the my_data_clean
relation.
This relation, however, operates as a view over the data, i.e., it does not actually store the changed data on disk.
Instead, the RKGS remembers the definition of my_data_clean
.
In this case, my_data_clean
is a derived relation.
For more details on the functionality and differences of these two kinds of relations, see Base and Derived Relations in Rel Primer: Basic Syntax.
When you want to remove or change the data from the system, you should do so in the base relation. In this case, all derived relations are automatically updated to reflect the change.
Note that adding or changing data in a derived relation won’t work since the new data will essentially be a union of the new data with the existing data in the derived relation. The logic for changing data in a derived relation should be installed or defined in a query. For base relations, changing data should be a write transaction.
For more details, see Working With Base Relations.
Adding a Column
You can use Rel to compute new columns over the data that may be useful for your analysis.
As an example, the cars.csv
file that you loaded earlier has two columns, Horsepower
and Weight
, but it doesn’t have the power-to-weight ratio (PWR
) column that measures the total power of a vehicle relative to its total weight.
You can compute the power-to-weight ratio using a point-free approach as follows:
// write query
def my_data_clean:PWR[row] = my_data_clean:Horsepower[row] / my_data_clean:Weight[row]
def output = ::std::display::table[my_data_clean]
The output displays the columns in alphabetical order based on their names, which places the PWR
column at the far right of the table.
Alternatively, you can use the point-wise syntax to compute the PWR
column:
def my_data_clean:PWR(row, pwr) {
exists(hp, w:
my_data_clean:Horsepower(row, hp)
and my_data_clean:Weight(row, w)
and pwr = hp / w
)
}
Deleting a Column
You can also delete a column in a similar way.
For example, here’s how you can delete the newly made column PWR
:
// write query
def delete:my_data_clean:PWR = my_data_clean:PWR
def output = ::std::display::table[my_data_clean]
As you can see, the PWR
column has now been removed.
Adding a Row
You can add a row in the data by providing the values needed for all the different columns.
Since the my_data
relation is stored in Graph Normal Form, you also need to specify a value for the row number.
The following code adds a row with the row value of 409
in the base relation my_data
.
This is equivalent to appending a row at the end of the file:
// write query
def row_val = 409
module new_data
def Acceleration = { (row_val, 4.9); }
def Car = { (row_val, "Nissan GTR Skyline R34"); }
def Cylinders = { (row_val, 6); }
def Displacement = { (row_val, 2600.0); }
def Horsepower = { (row_val, 276); }
def MPG = { (row_val, 16.0); }
def Model = { (row_val, 98); }
def Origin = { (row_val, "Japan"); }
def Weight = { (row_val, 439.2); }
end
def insert:my_data = new_data
// read query
def output = ::std::display::table[my_data_clean]
You can see the new row at the very end of the output.
Deleting a Row
Here is an example of how to delete a row:
// write query
def to_delete_row(row) {
exists(car:
my_data:Car(row, car)
and like_match("Nissan GTR\%", car)
)
}
def delete:my_data(col, row, val) {
my_data(col, row, val)
and to_delete_row(row)
}
// read query
def output = ::std::display::table[my_data_clean]
The row from the very end of the file has now been deleted.
Changing Values
In some cases, it’s useful to change the data.
For example, consider the values for the Origin
column in the data:
// read query
def output(origin) { my_data:Origin(_, origin) }
Now, say you’re interested in spelling out the US
acronym so that it’s consistent with the other two origins, i.e., Europe
and Japan
.
You can change one single value by specifying the given row and setting the new value.
Here’s an example that changes the origin in row 3
in the data to United States
:
// write query
def delete:my_data:Origin[3] = my_data:Origin[3]
def insert:my_data:Origin[3] = "United States"
Note that because my_data_clean
is a derived relation, you need to update the base relation my_data
. The changes propagate to my_data_clean
.
For more details, see Working With Base Relations.
// read query
def output = ::std::display::table[my_data_clean[col] for col in {:Car; :Origin;}]
You can now see that the origin value in row 3
has changed to United States
.
In Rel, you can also change multiple values at once.
Here’s an example that changes all the remaining US
values in the Origin
column to United States
:
// write query
def insert:my_data:Origin(row, orig) {
my_data:Origin(row, "US")
and orig = "United States"
}
// write query
def delete:my_data:Origin(row, orig) {
my_data:Origin(row, orig)
and orig = "US"
}
Note that the delete
needs to happen after the insert and thus you need to perform two separate transactions.
The data in the relation my_data_clean
have now been changed accordingly.
The next query displays all different origins in the data:
// read query
def output(origin) = my_data_clean:Origin(_, origin)
Imputing Data
Rel also provides the functionality for performing data imputation.
For this example, consider again the 0
values that were discussed earlier in the MPG
column:
// read query
def output(row, mpgval) {
my_data_clean:MPG(row, mpgval)
and mpgval = 0.0
}
As these values are most likely incorrect, you can use Rel to impute the data.
There are several approaches to this.
One way is to replace all 0
MPG with the overall average MPG value.
Here’s the code that does this:
// write query
// Get non-zero `mpgs`.
def nz_mpgs(row, mpgval) {
my_data_clean:MPG(row, mpgval)
and mpgval != 0.0
}
// Compute the average.
def nz_mpgs_avg = mean[nz_mpgs]
// Impute.
def insert:my_data:MPG(row, val) {
my_data:MPG(row, 0.0)
and val = nz_mpgs_avg
}
// write query
def delete:my_data:MPG(row, val) {
my_data:MPG(row, val)
and val = 0.0
}
// read query
def output(row, mpgval) {
my_data_clean:MPG(row, mpgval)
and mpgval = 0.0
}
There are now no 0
MPG values in the data.
In this section, you imputed certain values, i.e., you replaced certain values with another computed number. You can follow a similar approach for values that are missing. For more details, see Missing Values in the Graph Normal Form concept guide.
Data Visualization
Rel allows you to visualize data through the Vega-Lite (opens in a new tab) library. You can plot different views of your data using the various visualization capabilities.
For example, you can plot a histogram of the number of cars in each Origin
.
The data contain cars from three different origins, namely Europe
, Japan
, and US
.
The Vega-Lite library requires the data in a slightly different form.
First, the data need to be numbered consecutively.
You can do this using the lined_csv
functionality.
Second, it requires the data to be in array format.
Here’s the model to perform these operations:
// model
def my_data_lined = lined_csv[my_data_clean]
def my_data_graph[:[], i, col] = my_data_lined[col, i]
You can now plot a histogram of the number of cars per origin:
// read query
// Assign the data.
def chart:data:values = my_data_graph
// Set up the chart.
def chart:mark:type = "bar"
def chart:mark:tooltip = boolean_true
def chart = vegalite_utils:x[{
(:field, "Origin");
(:title, "Origin");
(:type, "ordinal");
(:axis, {
(:labelAngle, 45);
(:ticks, boolean_true);
(:grid, boolean_true);
})
}]
def chart = vegalite_utils:y[{
(:aggregate, "count");
(:type, "quantitative");
}]
// Display.
def output = ::std::display::vegalite::plot[chart]
Similarly, you can check the correlation between horsepower and acceleration by creating a scatter plot using Vega-Lite. You can additionally color the different points in the plot using different colors per origin. Here’s the code to create this chart:
// read query
// Assign the data.
def chart:data:values = my_data_graph
// Set up the chart.
def chart:mark = "point"
def chart = vegalite_utils:x[{
(:field, "Horsepower");
(:title, "Horsepower (hp)");
(:type, "quantitative");
(:scale, :zero, boolean_false);
}]
def chart = vegalite_utils:y[{
(:field, "Acceleration");
(:title, "Acceleration 0-60mph (sec)");
(:type, "quantitative");
(:scale, :zero, boolean_false);
}]
def chart = vegalite_utils:color[{
(:field, "Origin");
(:type, "nominal");
(:scale, :domain, :[], {(1, "United States"); (2, "Japan"); (3, "Europe")});
(:title, "Origin");
}]
// Display.
def output = ::std::display::vegalite::plot[chart]
Summary
You have learned how to work with CSV data using Rel by loading, exploring, and manipulating data and their schema. Using the examples in this guide, you are now able to import various forms of CSV data into your RAI database and work with the data. See also the CSV Import, CSV Export, and JSON Import guides for more information.