CSV Import

This how-to guide demonstrates how to import data from a CSV file using Rel.

Goal

By following this guide, you will be able to import CSV data into the RKGMS database and learn how to modify the import configurations to handle any type of CSV format and data location.

Introduction

The Comma Separated Values (CSV) data format is one of the most common, and CSV data is ubiquitous. This how-to guide will demonstrate how to import a wide range of CSV-like data with Rel, and describe all of the CSV import configuration options.

Simple Import

This guide will begin with a simple example. (For a more detailed one, see the Comprehensive Example below.) Consider the following CSV file with two string columns available in this public S3 location:

country, drink
Spain,Gazpacho
Spain,Sangría
Argentina,"Yerba Mate"
United States,"Coca-Cola"
Italy,San Pellegrino

The following Rel code loads this file into a relation called my_data, and displays it using the table utility:

def my_data = load_csv["s3://relationalai-documentation-public/csv-import/simple-drinks.csv"]

def output = table[my_data]

Relation: output

countrydrink
RelationalAITypes.FilePos(15)"Spain""Gazpacho"
RelationalAITypes.FilePos(30)"Spain""Sangría"
RelationalAITypes.FilePos(45)"Argentina""Yerba Mate"
RelationalAITypes.FilePos(68)"United States""Coca-Cola"
RelationalAITypes.FilePos(94)"Italy""San Pellegrino"

The imported CSV data is stored in the my_data relation, which is a Rel module. Each column in the original CSV is assigned to a sub-relation, which is given the same name as the column header from the CSV. For more details, see the CSV Data Model section below. The table utility converts it back to a tabular form for display purposes.

File positions are used as the key to connect row entries — they could well be line numbers instead, but file positions are more convenient when loading large files. The lined_csv utility converts to line numbers if needed.

By default, all columns are loaded as strings. There is no automatic type conversion, so ...,123,... becomes the string "123".

Defining Import Options

To specify data types for each column, include them in a config relation, which is a module that defines the import configuration.

Consider this data:

cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

config:schema specifies data types for each column:

// import configuration
module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"

def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64, 2)");
(:date, "date");
}
end

// import csv data as EDB
def insert:my_data = load_csv[config]

// display imported data
def output = table[my_data]

Relation: output

cocktaildatepricequantity
RelationalAITypes.FilePos(29)"martini"2020-01-0112.502
RelationalAITypes.FilePos(60)"sazerac"2020-02-0214.254
RelationalAITypes.FilePos(91)"cosmopolitan"2020-03-0311.004
RelationalAITypes.FilePos(127)"bellini"2020-04-0412.253

The statement def insert:my_data = load_csv[config] saves the imported data in the EDB relation my_data. Note that the import is done within an Update transaction (a non-read-only query), and the config definition is only present for the duration of the query. For more details on working with EDB data, see the Concept Guide Updating Data: Working with EDB Relations.

Querying CSV Data

Now you can query the imported CSV data by joining the columns of interest by their common key—the file position:

def output(c, d) =
my_data:cocktail(pos, c) and my_data:date(pos, d)
from pos

Relation: output

"bellini"2020-04-04
"cosmopolitan"2020-03-03
"martini"2020-01-01
"sazerac"2020-02-02

The section CSV Data Model describes in more detail how the CSV data is stored in our RKGMS.

Import Options

For very simple situations, where no configuration needs to be specified, just provide the file path:

def data = load_csv["s3://...."]

For more customized imports,load_csv takes a configuration relation, config, as an argument.

module config
// ...
end
def data = load_csv[config]

For clarity, calling it config is recommended, but you can give it any name. The best practice is to make it a transient definition in an update query, so it will not affect other loads; installing it is discouraged. You can, however, install template configurations (with a different name) and then instantiate them to derive more specific configurations — see Reusing Configurations.

Within the configuration module config, you can specify:

OptionDescription
pathA URL for the CSV data.
integrationCredentials needed to access the data.
dataAn alternative way to specify the data itself.
schemaSchema of the CSV data.
syntaxSyntax of the CSV data.

Path

The import option path is a string that specifies the location and name of the file you want to import. Currently, this string can point to s3://... (Amazon S3) or azure://... (Microsoft Azure) URLs.

Example:

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"
end

Integration

You may need to specify credentials for accessing your data in cloud storage, using the import option integration. This is especially needed when the cloud storage is not publicly accessible.

OptionDescription
integration:providerCloud storage provider. Currently only Azure ("azure") is supported.
integration:credentialsCloud storage access credentials. A tuple with the credential type as RelName and the token as String. (e.g.: (:azure_sas_token, "sv=...") for Azure Cloud.)
module config
def path = "azure://myaccount.blob.core.windows.net/sascontainer/myfile.txt"

module integration
def provider = "azure"
def credentials = (:azure_sas_token, "sv=2014-02-14&sr=b&si=TestPolicy&sig=o%2B5%2F0C%2BLm7tWWftNKvQEGKHlSt%2Bfs8No7FZkUk5T%2Bv0%3D")
end
end

Data

The import option is a string specifying the actual data to be imported. If path is also present, the data field is ignored and only path is used for importing the data.

Example:

module config
def data = "a,b,c\n1,2,3\n4,5,6"
end

You can also use the multi-line string functionality to write CSV data with several rows.

Schema

The import option schema is a binary relation indicating the file schema. The pairs are of the form (RelName, String). The column name is specified as RelName and the data type as String (e.g.: (:a, ,"int")).

The supported data types are "int", "string", "float","decimal(n, digits)", "date", and "datetime". For decimals, n is the number of bits and digits is the number of decimal digits, which should correspond to a valid Fixed Decimal Rel Type. The default behavior is that everything is loaded as a string. Note that the type definitions are case sensitive.

Example:

module config
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64, 2)");
(:date, "date");
}
end

Syntax

The import option syntax is a module with sub-fields and is used to specify the CSV file format. The module syntax has the following sub-fields as options:

OptionDescription
syntax:headerA relation with (Int, RelName) pairs, specifying the symbols to be used as column names, where the Int indexes indicate the column order. This option overrides the default (or existing) header names.
syntax:header_rowAn Int that specifies the input row that contains the header names, if they are not defined in the config through the header parameter above. Values smaller than 1 indicate that no header is present. If you specify a value smaller than 1, then Rel creates column names for you (see Defining Column Names. If no header_row value is specified, a default of 1 is used (i.e., using the first line of the file).
syntax:delimThe Char delimiter used in the file to separate the columns. The default delimiter is a comma ,.
syntax:quotecharA Char that indicates a quoted text field. A quoted text field is required to contain special characters like the delimiter or a newline character in a text. The default quote character is ".
syntax:escapecharThe Char used to escape the quote characters within a quoted (e.g., text) field. The default value is a backslash \.

Example:

module config
module syntax
def delim = '|'
def header_row = -1
end
end

CSV Data Model

In Rel, data is best stored in Graph Normal Form (GNF), a fully normalized representation that takes advantage of Rel features such as modularity and specialization.

CSV data imported via load_csv follows basic GNF conventions. Concretely, the CSV data is represented as triplets of the form (colname, pos, value):

def data(colname, pos, v) = load_csv[config](colname, pos, v)

where

  • colname refers to the CSV column names and is of type RelName.
  • pos is the record identifier and has the type FilePos. It uniquely identifies the file row.
  • value is the actual data.

By storing the CSV data in this format, the relation data becomes a Rel module. To understand how modules work, see the Modules concept guide.

Using the file position instead of the row number as a record identifier allows parallelized loading of CSV data. See section Replacing FilePos with Row Numbers if you want to use the row number instead of the FilePos as the record identifier.

The Rel code that defines data above is written in the point-wise form, where all variables are explicitly stated. A more compact way of writing the above query is by using the point-free style, where the variables are implied:

def data = load_csv[config]

Once imported, the raw CSV data will usually be schema-mapped and transformed further, according to the desired data model. GNF is recommended for cleaner modeling and efficient reasoning.

Specifying Data Sources

Cloud

You can import CSV files from the cloud. If the cloud storage location is publicly available (e.g.: this file), you only need to provide the path, in the configuration option path.

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"
end

Currently, Amazon S3 (“s3://") and Azure Blob Storage (“azure://") are supported.

Note: Currently, S3 data must be in buckets in the (N. Virginia) us-east-1 region.

Cloud Credentials

In order to access private data, you need to specify cloud credentials in the import option integration.

Currently, only private Azure Blob Storage is supported. To access it, you must provide a valid SAS token and URL address, corresponding to a valid Azure account.

The URL is provided via the path import option. Within the integration submodule, the following information needs to be provided:

  • the cloud storage provider in the provider field, and
  • the access token as a string in the credentials field, along with the token name (:azure_sas_token).
module config
def path = "azure://myaccount.blob.core.windows.net/sascontainer/myfile.txt"

module integration
def provider = "azure"
def credentials = (:azure_sas_token, "sv=2014-02-14&sr=b&si=TestPolicy&sig=o%2B5%2F0C%2BLm7tWWftNKvQEGKHlSt%2Bfs8No7FZkUk5T%2Bv0%3D")
end
end

def data = load_csv[config]

Local Files

To import CSV files, you need to use the RAI Console. In the Console, you can either:

  • Upload data directly to the database using the “Data” section in the object tree for the given DB.
  • Upload data using the RAI notebook file upload mechanism.

In both cases, the uploaded file is stored as a string in an EDB relation, say, my_uploaded_file. To process the file as CSV data, use the load_csv functionality with config:data pointing to that EDB relation:

def config:data(text) = my_uploaded_file(text) and String(text)
def my_data = load_csv[config]

The condition String(text) ensures that only string values get assigned to config:data. See the next section, Strings, for details on how to specify CSV data directly as a string.

Strings

CSV data can be specified directly as a string constant. This is handy for small data sets and tests.

Here is a small example:

def config:data = """
a|b|c
1|2|3
4|5|6
"""
def config:syntax:delim = '|'

def output = load_csv[config]

Relation: output

:aRelationalAITypes.FilePos(6)"1"
:aRelationalAITypes.FilePos(12)"4"
:bRelationalAITypes.FilePos(6)"2"
:bRelationalAITypes.FilePos(12)"5"
:cRelationalAITypes.FilePos(6)"3"
:cRelationalAITypes.FilePos(12)"6"

Compression (gzip)

You can load cloud files compressed with gzip. In order for gzip import to work, these must have the file extension .gz.

Currently, files imported with the local file upload mechanism must be uncompressed.

Specifying Schema Options

Column Data Types

Consider the following data file with four columns, each with a different data type: string, integer, float, and date.

cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

file: simple-import-4cols.csv

Using the schema option, you can import this file and specify the data type of each of the imported columns:

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"

def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64, 2)");
(:date, "date");
}
end

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(29)"martini"
:cocktailRelationalAITypes.FilePos(60)"sazerac"
:cocktailRelationalAITypes.FilePos(91)"cosmopolitan"
:cocktailRelationalAITypes.FilePos(127)"bellini"
:dateRelationalAITypes.FilePos(29)2020-01-01
:dateRelationalAITypes.FilePos(60)2020-02-02
:dateRelationalAITypes.FilePos(91)2020-03-03
:dateRelationalAITypes.FilePos(127)2020-04-04
:priceRelationalAITypes.FilePos(29)12.50
:priceRelationalAITypes.FilePos(60)14.25
:priceRelationalAITypes.FilePos(91)11.00
:priceRelationalAITypes.FilePos(127)12.25
:quantityRelationalAITypes.FilePos(29)2
:quantityRelationalAITypes.FilePos(60)4
:quantityRelationalAITypes.FilePos(91)4
:quantityRelationalAITypes.FilePos(127)3

This confirms that the data in each column has the data type that was specified above.

Note that quotes around individual values are stripped; to escape quotes, the default is \\". Any spaces before or after a comma are preserved (as is standard with CSVs), with the exception of the header row.

Column Names

The previous examples imported the file using the default column names, as defined in the first row of the file (this row number can be specified with the header_row parameter). Rel also lets you specify any column names you would like. Once you decide to specify the header, you must specify all the column names. You’re unable to skip any of them.

The next example loads the same data, but changes the imported column names: it renames cocktail to beverage, quantity to amount, and price to value. (The name date is not changed). This is done using the syntax and header parameters within the config relation. The header parameter takes as input tuples of the form (col_pos, col_name), which indicate how to rename each of the columns. col_pos refers to the column position as it appears in the imported file.

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"

def syntax:header = {
(1, :beverage); (2, :amount);
(3, :value); (4, :date)
}

def schema = {
(:beverage, "string");
(:amount, "int");
(:value, "decimal(64,2)");
(:date, "date");
}
end

def csv = load_csv[config]
def output = csv

Relation: output

:amountRelationalAITypes.FilePos(29)2
:amountRelationalAITypes.FilePos(60)4
:amountRelationalAITypes.FilePos(91)4
:amountRelationalAITypes.FilePos(127)3
:beverageRelationalAITypes.FilePos(29)"martini"
:beverageRelationalAITypes.FilePos(60)"sazerac"
:beverageRelationalAITypes.FilePos(91)"cosmopolitan"
:beverageRelationalAITypes.FilePos(127)"bellini"
:dateRelationalAITypes.FilePos(29)2020-01-01
:dateRelationalAITypes.FilePos(60)2020-02-02
:dateRelationalAITypes.FilePos(91)2020-03-03
:dateRelationalAITypes.FilePos(127)2020-04-04
:valueRelationalAITypes.FilePos(29)12.50
:valueRelationalAITypes.FilePos(60)14.25
:valueRelationalAITypes.FilePos(91)11.00
:valueRelationalAITypes.FilePos(127)12.25

To indicate that there is no header in the file, set the parameter syntax:header_row to 0 or any negative number.

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"

def syntax:header_row = -1
end

def csv = load_csv[config]
def output = csv

Relation: output

:COL1RelationalAITypes.FilePos(0)"cocktail"
:COL1RelationalAITypes.FilePos(29)"martini"
:COL1RelationalAITypes.FilePos(60)"sazerac"
:COL1RelationalAITypes.FilePos(91)"cosmopolitan"
:COL1RelationalAITypes.FilePos(127)"bellini"
:COL2RelationalAITypes.FilePos(0)"quantity"
:COL2RelationalAITypes.FilePos(29)"2"
:COL2RelationalAITypes.FilePos(60)"4"
:COL2RelationalAITypes.FilePos(91)"4"
:COL2RelationalAITypes.FilePos(127)"3"
:COL3RelationalAITypes.FilePos(0)"price"
:COL3RelationalAITypes.FilePos(29)"12.50"
:COL3RelationalAITypes.FilePos(60)"14.25"
:COL3RelationalAITypes.FilePos(91)"11.00"
:COL3RelationalAITypes.FilePos(127)"12.25"
:COL4RelationalAITypes.FilePos(0)"date"
:COL4RelationalAITypes.FilePos(29)"2020-01-01"
:COL4RelationalAITypes.FilePos(60)"2020-02-02"
:COL4RelationalAITypes.FilePos(91)"2020-03-03"
:COL4RelationalAITypes.FilePos(127)"2020-04-04"

In this case, the system assigned default COLX names to the columns. Additionally, it considered the first row (which actually contains the column headers) as actual data and imported that data as well.

Since no types were specified, all data were read in as strings. You can specify the schema using the COLX attributes. For example for the simple file:

"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

file: simple-import-4cols-noheader.csv

you can specify the types of the columns as follows:

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-4cols-noheader.csv"

def syntax:header_row = -1

def schema = {
(:COL1, "string");
(:COL2, "int");
(:COL3, "decimal(64,2)");
(:COL4, "date");
}
end

def csv = load_csv[config]
def output = csv

Relation: output

:COL1RelationalAITypes.FilePos(0)"martini"
:COL1RelationalAITypes.FilePos(31)"sazerac"
:COL1RelationalAITypes.FilePos(62)"cosmopolitan"
:COL1RelationalAITypes.FilePos(98)"bellini"
:COL2RelationalAITypes.FilePos(0)2
:COL2RelationalAITypes.FilePos(31)4
:COL2RelationalAITypes.FilePos(62)4
:COL2RelationalAITypes.FilePos(98)3
:COL3RelationalAITypes.FilePos(0)12.50
:COL3RelationalAITypes.FilePos(31)14.25
:COL3RelationalAITypes.FilePos(62)11.00
:COL3RelationalAITypes.FilePos(98)12.25
:COL4RelationalAITypes.FilePos(0)2020-01-01
:COL4RelationalAITypes.FilePos(31)2020-02-02
:COL4RelationalAITypes.FilePos(62)2020-03-03
:COL4RelationalAITypes.FilePos(98)2020-04-04

The data is now imported with their correct types: string, int, decimal, and date.

Specifying Syntax Options

Defining a Header Row

Going back to the data with the headers, you can also use the second row to define the column names.

Specifying header_row=2 tells the system where to start reading within the file. The first row of cocktail,quantity,price,date is ignored. The system starts at line 2, considers that line as a header, and then loads the rest of the data.

In general, all rows above the header_row are ignored. This option is useful when a number of lines at the beginning of the file need to be ignored. For example, if there are comments before the actual header and data that describe what the file is about. By specifying the right header_row, you can load the header and subsequent data without having to edit the file to remove any comments at the top.

In our example, this generates clearly incorrect results, since row 2 is not, in fact, a header file:

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"

def syntax:header_row = 2
end

def csv = load_csv[config]
def output = csv

Relation: output

:12.50RelationalAITypes.FilePos(60)"14.25"
:12.50RelationalAITypes.FilePos(91)"11.00"
:12.50RelationalAITypes.FilePos(127)"12.25"
:2RelationalAITypes.FilePos(60)"4"
:2RelationalAITypes.FilePos(91)"4"
:2RelationalAITypes.FilePos(127)"3"
:2020-01-01RelationalAITypes.FilePos(60)"2020-02-02"
:2020-01-01RelationalAITypes.FilePos(91)"2020-03-03"
:2020-01-01RelationalAITypes.FilePos(127)"2020-04-04"
:martiniRelationalAITypes.FilePos(60)"sazerac"
:martiniRelationalAITypes.FilePos(91)"cosmopolitan"
:martiniRelationalAITypes.FilePos(127)"bellini"

Modifying Delimiters

You can change the delimiters used to read the data. Consider the following file, which uses | as a delimiter:

cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20

file: simple-import-1col-pipedelim.csv

You can read such data by specifying the delim parameter in the config relation:

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-1col-pipedelim.csv"

def schema = {
(:cocktail, "string");
(:price, "int");
}

def syntax:delim = '|'
end

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(15)"sazerac"
:cocktailRelationalAITypes.FilePos(28)"martini"
:cocktailRelationalAITypes.FilePos(41)"cosmopolitan"
:priceRelationalAITypes.FilePos(15)30
:priceRelationalAITypes.FilePos(28)24
:priceRelationalAITypes.FilePos(41)20

Loading TSV Files

For the common case of loading a tab-separated file (TSV), set the delimiter to \t:

def config:syntax:delim = '\t'

The other config options will work in the same way as discussed in this document for CSVs.

Comprehensive Example

Here is a more comprehensive import example serving as a template that can be copy-and-pasted. It touches on multiple CSV loading aspects users will encounter when working with CSV data.

def mydata = """
first,last,age,plays_for
Leonel,Messi,32,PSG
Cristiano,Ronaldo,36,"Manchester United"
"""

// import configuration
module config
def data = mydata

module schema
def first = "string"
def last = "string"
def age = "int"
def plays_for = "string"
end

module syntax
def header_row = 1
def delim = ','

// to change the header names:
// def header = {(1, :firstname); (2, :lastname); (3, :age); (4, :team)}
end
end

// intermediate IDB relation
def csv = load_csv[config]

// persist loaded CSV data in the database
def delete[:csv_edb] = csv_edb // clear potential previous contents of EDB data
def insert[:csv_edb] = csv

// display the persisted data
def output = table[csv_edb]

Relation: output

agefirstlastplays_for
RelationalAITypes.FilePos(25)32"Leonel""Messi""PSG"
RelationalAITypes.FilePos(45)36"Cristiano""Ronaldo""Manchester United"

The CSV data is included as a string constant to make this example as self-contained as possible. To load CSV data from the cloud, please use the config:path and config:integration options as described in the Cloud section.

The Rel code above is executed as an Update transaction needed to insert the loaded CSV data into the database and persist it in the EDB relation csv_edb. The command def delete[:csv_edb] = csv_edb ensures the new CSV data is inserted into a blank EDB relation and not appended to an existing one. If this behavior is not wanted, you can remove this line.

The intermediate IDB relation csv is not persisted in the database. It is, however, still useful to first assign the loaded CSV data to an intermediate IDB relation. This is especially helpful when data cleaning or filtering steps should be performed before persisting the loaded CSV data in the database.

Advanced Import Scenarios

Replacing FilePos with Row Numbers

The examples so far have used either the file position as a key for each row, or, as in the previous example, loaded the key directly from the input file. Rel provides another option for loading the data: using the row number as the key. To do this, the load_csv output relation is passed to the lined_csv utility, as follows:

def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64, 2)"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def rows_csv = lined_csv[csv]

Relation: rows_csv

:cocktail1"martini"
:cocktail2"sazerac"
:cocktail3"cosmopolitan"
:cocktail4"bellini"
:date12020-01-01
:date22020-02-02
:date32020-03-03
:date42020-04-04
:price112.50
:price214.25
:price311.00
:price412.25
:quantity12
:quantity24
:quantity34
:quantity43

The result now has successive row IDs as keys, instead of file positions. Note that lined_csv only counts lines that contain data, including errors, while headers and empty lines are not counted.

Using a Column as the Primary Key

The previous examples used the file offset as the key to identify each row. This key is known as the primary key. Instead of using the file offset, you can specify another data column as the primary key. For example, consider the following data, where the first column id is meant to be the primary key:

id,cocktail
11,"martini"
22,"sazerac"
33,"cosmopolitan"
44,"bellini"

file: simple-import-2cols-with-key.csv

To use the id column as the key, you can proceed as follows: First, read in the data in a similar way to the examples so far. Then, convert the csv relation to another relation, cocktails, which replaces the FilePos row key with the values in the id column:

module config
def path = "s3://relationalai-documentation-public/csv-import/simple-import-2cols-with-key.csv"

def schema = {
(:id, "int");
(:cocktail, "string");
}
end

def csv = load_csv[config]

def cocktails(:cocktail, id, cocktail) =
csv(:id, pos, id) and csv(:cocktail, pos, cocktail)
from pos

Relation: cocktails

:cocktail11"martini"
:cocktail22"sazerac"
:cocktail33"cosmopolitan"
:cocktail44"bellini"

Please note that the key used should be unique to each row. It is good practice to ensure that this key is unique using an integrity constraint.

In general, the loaded CSV can be transformed in multiple ways before it is used by the rest of the model or saved as an EDB. This corresponds to an ELT (extract-load-transform) approach, and can be more flexible than the traditional ETL (extract-transform-load), which transforms before the load.

Loading Multiple Files in the Same Relation

Certain cases call for loading data from several different files into the same relation. For example, you could have two files that correspond to sales over the period of two different months, or a dataset that was simply split into several files (as discussed in more detail in the following section). Here are two such example files showing cocktails and units sold in the months of March and April:

cocktail,units
martini,12
sazerac,23

file: cocktail-units-march.csv

cocktail,units
martini,12
bellini,32

file: cocktail-units-april.csv

To load the data in this case, you need to use load_csv twice, one for each file. In order to distinguish data by file, you can use the month name as an additional key for the import configuration module config and the imported data csv:

def my_schema = {
(:cocktail, "string");
(:units, "int");
}

module config["march"]
def path = "s3://relationalai-documentation-public/csv-import/cocktail-units-march.csv"
def schema = my_schema
end

module config["april"]
def path = "s3://relationalai-documentation-public/csv-import/cocktail-units-april.csv"
def schema = my_schema
end

def csv[month] = load_csv[config[month]]

def output = csv

Relation: output

"april":cocktailRelationalAITypes.FilePos(15)"martini"
"april":cocktailRelationalAITypes.FilePos(26)"bellini"
"april":unitsRelationalAITypes.FilePos(15)12
"april":unitsRelationalAITypes.FilePos(26)32
"march":cocktailRelationalAITypes.FilePos(15)"martini"
"march":cocktailRelationalAITypes.FilePos(26)"sazerac"
"march":unitsRelationalAITypes.FilePos(15)12
"march":unitsRelationalAITypes.FilePos(26)23

Examining the output, observe that you have an additional string, either "march" or "april" identifying each dataset. Note that if only file positions are used as keys when loading from different data files, then the file positions might overlap, leading to unintentional data loss.

To make this more specific, if you dropped the month as a key in csv and wrote:

def csv = load_csv[config[_]]

then you would obtain the tuple martini,12 only once in the loaded data. This is for two reasons: First, the tuple is the same in both files. Second, it happens that the tuples appear in the exact same file position (specifically, FilePos(15)). As a result, (:cocktail, FilePos(15), "martini") will be generated by load_csv for both files. Because you are performing a union over both loaded datasets, duplicate tuples will be dropped, and (:cocktail, FilePos(15), "martini") will appear only once in csv.

Even if the tuples were different, it would become impossible to reconstruct the two rows after the load, as you would be unable to tell which attribute value came from which file.

Multi-Part File Loads

In cases where a large data set is split up into different parts, you can use a file ID to distinguish them, and include it as part of the key, along with the file position.

The following example loads 5 files, country_part1.gz through country_part5.gz, into a single csv relation. The config relation is parameterized by the index i, which is also included in the csv relation:

def pattern = "s3://relationalai-documentation-public/csv-import/countries/country_partNNN.tsv.gz"
def partid = range[1,5,1]
def files = string_replace[pattern, "NNN", string[x]] for x in partid

module config[i in partid]
def path = files[i]
def syntax:delim = '\t'
def syntax:header_row = -1
def syntax:header = (1, :ccode) ; (2, :cname)
def schema = (:ccode, "string") ; (:cname, "string")
end

def csv(col, i, pos, v) = load_csv[config[i]](col, pos, v)

def output:num_codes = count[code: csv:ccode(i, pos, code) from i, pos]
def output:num_pos = count[pos : csv:ccode(i, pos, code) from i, code]
def output:names(n) = csv:cname(_, _, n)
and (like_match("Ar%", n) or like_match("Z%", n) )

Relation: output

:names"Argentina"
:names"Armenia"
:names"Aruba"
:names"Zambia"
:names"Zimbabwe"
:num_codes250
:num_pos209

Note that while this code loads data for 250 countries, the number of file positions is smaller, since some rows start at the same file position in more than one input file. Including i in the csv relation makes sure that those rows are distinguished from each other.

Import Errors

Sometimes, there are import errors that are not anticipated or which cannot be handled. In either case, load_csv supports error handling to help you understand what may have gone wrong.

When the CSV input file cannot be reliably read (e.g., it is formatted incorrectly), Rel attempts to parse as many rows as possible. Any rows that it was not able to reliably parse are stored in the :load_errors sub-relation. This sub-relation can be accessed the same way as the rest of the sub-relations in the imported relation (e.g. csv:load_errors in the previous examples).

The :load_errors relation has the format (:load_errors, file_position, line, explanation). Note that it has arity 4, instead of the normal arity 3 of successfully imported column relations. Examples are shown below.

Formatting Errors

Here is an example file that contains two formatting errors, namely, an extra column per line for the first two lines:

cocktail,price
"sazerac",30,"extra"
"martini",24,"another_extra"
"cosmopolitan",20

file: import-errors-extracol.csv

In this example file, the first line is a header indicating that there should be two columns in the file. The second and third lines, however, have one extra field, which is an error. Finally, the last line is correct.

You can now attempt to import this data. Consider that the first column should be a string, while the second column should be an integer, and use the following code:

def config[:path] = "s3://relationalai-documentation-public/csv-import/import-errors-extracol.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(65)"cosmopolitan"
:load_errorsRelationalAITypes.FilePos(15)2""sazerac",30,"extra""
:load_errorsRelationalAITypes.FilePos(36)2""martini",24,"another_extra""
:priceRelationalAITypes.FilePos(65)20

This output specifies two load_errors, and displays the erroneous lines as the last field. The number right after :load_errors indicates the 1-based index of the field where Rel believes the error to be, or 0 if the whole line could not be parsed for some reason, as is the case here.

Ensuring No Import Errors

Note that one easy way to ensure no loading errors is through the use of an integrity constraint.

Such a constraint could just count the number of :load_errors in the resulting relation (csv in our examples). If the number of occurrences is larger than zero, then the loading did not complete successfully and the constraint would be violated.

Here is the same example as before with the addition of an integrity constraint, no_load_errors, which is violated when attempting to load problematic data:

def config[:path] = "s3://relationalai-documentation-public/csv-import/import-errors-extracol.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def csv = load_csv[config]

ic no_load_errors(x...) { csv(:load_errors, _, x...) }

You can add similar constraints to check for various kinds of data validation aspects.

Error in the Data Type

In the following example, there is no real error in data, but the configuration specifies incorrect data types for each column: int for the cocktail name and string for the price. Here is the data:

cocktail,price
"sazerac",30
"martini",24
"cosmopolitan",20

file: import-errors-wrongtype.csv

Try to load the data as follows (note the incorrect data type definitions):

def config[:path] = "s3://relationalai-documentation-public/csv-import/import-errors-wrongtype.csv"
def config[:schema, :cocktail] = "int"
def config[:schema, :price] = "string"

def csv = load_csv[config]
def output = csv

Relation: output

:load_errorsRelationalAITypes.FilePos(15)1""sazerac",30"
:load_errorsRelationalAITypes.FilePos(28)1""martini",24"
:load_errorsRelationalAITypes.FilePos(41)1""cosmopolitan",20"

The system reports three errors in the data, and specifies the lines for the errors. It notes that it could not convert the cocktail names to int and therefore could not correctly parse those lines.

At the same time, Rel is following a best-effort approach to parse as much data as possible. Although it was not able to convert strings to int, it was able read in the integers of the price column as string. And, as a result, it imported that data, which is the last three lines of the output above.

In summary, keep in mind that, even if a line of the file may contain an error, Rel may still parse and import part of that line.

Empty Lines

Another example of this behavior is when there are empty lines within a file. Rel will completely ignore them and continue parsing in the next line with data. For example, for the following data:

cocktail,price
"sazerac",30


"martini",24


"cosmopolitan",20


file: import-errors-emptylines.csv

You can read this data using the following code:

def config[:path] = "s3://relationalai-documentation-public/csv-import/import-errors-emptylines.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(15)"sazerac"
:cocktailRelationalAITypes.FilePos(30)"martini"
:cocktailRelationalAITypes.FilePos(45)"cosmopolitan"
:priceRelationalAITypes.FilePos(15)30
:priceRelationalAITypes.FilePos(30)24
:priceRelationalAITypes.FilePos(45)20

The result shows that the empty lines were not considered errors, and the entire file was successfully loaded into the relation csv.

Missing Values

Datasets might simply not contain one or more values per row. Here is an example data load where the price value is missing from the third and fifth rows:

def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64,2)"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(29)"martini"
:cocktailRelationalAITypes.FilePos(60)"sazerac"
:cocktailRelationalAITypes.FilePos(86)"cosmopolitan"
:cocktailRelationalAITypes.FilePos(122)"bellini"
:dateRelationalAITypes.FilePos(29)2020-01-01
:dateRelationalAITypes.FilePos(60)2020-02-02
:dateRelationalAITypes.FilePos(86)2020-03-03
:dateRelationalAITypes.FilePos(122)2020-04-04
:priceRelationalAITypes.FilePos(29)12.50
:priceRelationalAITypes.FilePos(86)11.00
:quantityRelationalAITypes.FilePos(29)2
:quantityRelationalAITypes.FilePos(60)4
:quantityRelationalAITypes.FilePos(86)4
:quantityRelationalAITypes.FilePos(122)3

The fields with the missing values were skipped and not imported, but note that there are no :load_errors in the result.

The next example loads the same dataset as before, but representing missing values with the string NA:

def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64,2)"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(29)"martini"
:cocktailRelationalAITypes.FilePos(88)"cosmopolitan"
:dateRelationalAITypes.FilePos(29)2020-01-01
:dateRelationalAITypes.FilePos(88)2020-03-03
:load_errorsRelationalAITypes.FilePos(60)3""sazerac",4,NA,"2020-02-02""
:load_errorsRelationalAITypes.FilePos(124)3""bellini",3,NA,"2020-04-04""
:priceRelationalAITypes.FilePos(29)12.50
:priceRelationalAITypes.FilePos(88)11.00
:quantityRelationalAITypes.FilePos(29)2
:quantityRelationalAITypes.FilePos(88)4

The result is the same as before, plus two load_errors. This is because the configuration specified that column 3 (which is missing from two rows) is a decimal(64,2). As the “NA” strings cannot be parsed as decimal(64,2), the system considers that there was an error in these rows. Similarly to the previous examples, Rel did a best-effort approach and imported whatever was possible.

The following code loads the same file, specifying instead that the third column has type “string”:

def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "string"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(29)"martini"
:cocktailRelationalAITypes.FilePos(60)"sazerac"
:cocktailRelationalAITypes.FilePos(88)"cosmopolitan"
:cocktailRelationalAITypes.FilePos(124)"bellini"
:dateRelationalAITypes.FilePos(29)2020-01-01
:dateRelationalAITypes.FilePos(60)2020-02-02
:dateRelationalAITypes.FilePos(88)2020-03-03
:dateRelationalAITypes.FilePos(124)2020-04-04
:priceRelationalAITypes.FilePos(29)"12.50"
:priceRelationalAITypes.FilePos(60)"NA"
:priceRelationalAITypes.FilePos(88)"11.00"
:priceRelationalAITypes.FilePos(124)"NA"
:quantityRelationalAITypes.FilePos(29)2
:quantityRelationalAITypes.FilePos(60)4
:quantityRelationalAITypes.FilePos(88)4
:quantityRelationalAITypes.FilePos(124)3

There are no errors now.

Such files can thus be loaded using “string” as the type of the missing column; the string values can be cleaned, filtered, and converted to their intended type using Rel, after the load.

Error in Delimiter

In the following example, the data is correct but use a different delimiter (|) than the one specified erroneously in our code (;):

def config[:data] = """
cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def config[:syntax, :delim] = ';'

def csv = load_csv[config]
def output = csv

Relation: output

:cocktail|priceRelationalAITypes.FilePos(15)"sazerac"
:cocktail|priceRelationalAITypes.FilePos(28)"martini"
:cocktail|priceRelationalAITypes.FilePos(41)"cosmopolitan"

Although you may have expected an error, the system performed as expected. As it did not find any of the specified delimiters, it considered that there was only one column within the file, and read the file accordingly.

Tips

Reusing Configurations

If you need to perform CSV imports that have a common configuration but some slight variations, you can reuse a partial configuration by extending or overriding it.

The following example uses config_template to define config1 and config2, changing the type of column b with the help of the right_override stdlib utility:

module config_template
def data = """
a,b
2,4
3,9
"""

def schema:a = "int"
def schema:b = "int"
end

def config1 = config_template ++> (:schema, :b, "float")
def config2 = config_template ++> (:schema, :b, "string")

def output:one = load_csv[config1]
def output:two = load_csv[config2]

Relation: output

:one:aRelationalAITypes.FilePos(4)2
:one:aRelationalAITypes.FilePos(8)3
:one:bRelationalAITypes.FilePos(4)4.0
:one:bRelationalAITypes.FilePos(8)9.0
:two:aRelationalAITypes.FilePos(4)2
:two:aRelationalAITypes.FilePos(8)3
:two:bRelationalAITypes.FilePos(4)"4"
:two:bRelationalAITypes.FilePos(8)"9"

Summary

You have learned how to import CSV data using Rel, in particular using the load_csv functionality. By following the examples, you are now able to import various forms of CSV data into your RKGMS database, including string constants, local CSV files, and files in cloud storage. You have learned about data import errors, and are able to follow the steps if you run into any loading errors. For related reading, see the CSV Export guide.