CSV Import

This how-to guide demonstrates how to import data from a CSV file using Rel.

Download this guide as a RAI notebook by clicking here.

Goal

After reading this guide you should know how to import various forms of CSV data into the Rel database, including string constants, local CSV files, and files in cloud storage.

Preliminaries

In the following examples, we will use several small files to showcase the different options that Rel offers when importing data. For each example, we provide a link for downloading the file, so you can run the examples locally.

You may also find the CSV Export how-to guide useful.

Simple Data Importing

Let’s begin by importing a very simple CSV file:

mydata
10
20
30
40

file: simple-import-1col.csv

We can import this data using load_csv and providing the location where the data is found.

query
def csv = load_csv[
"s3://relationalai-documentation-public/csv-import/simple-import-1col.csv"
]

def output = csv

Relation: output

:mydataRelationalAITypes.FilePos(7)"10"
:mydataRelationalAITypes.FilePos(10)"20"
:mydataRelationalAITypes.FilePos(13)"30"
:mydataRelationalAITypes.FilePos(16)"40"

After running the code above, we see that the relation csv holds the imported data from the specified file.

Even though the imported file had a single data column, the relation csv contains triples of the form (col, pos, value), where:

  1. col refers the CSV column names,
  2. pos is the record identifier that uniquely identifies the row (aka file position) the data is coming from. Using the file position instead of the actual row number allows for parallelized loading of CSV data. (See this section if you need row numbers in the final result.)
  3. value is the data from our imported CSV file.

The Rel code above is equivalent to the following, which spells out the variables explicitly (“pointwise”, instead of “pointfree”) and will load the same data into the csv relation:

def path = "s3://relationalai-documentation-public/csv-import/simple-import-1col.csv"
def csv(col, pos, v) = load_csv[path](col, pos, v)

In this very first example, our data was a single column: we had only one value per line. Additionally, as you may have observed, the data was imported as strings, although we would likely have preferred to have the data imported as integers in this case.

Rel, of course, lets us import data with more than one column, and define the column types. This is done by specifying import options, which we discuss next.

Import Options

To import data that require specifying the CSV syntax and data schema, we need to define a configuration relation—commonly called config—which we pass as a parameter to load_csv:

def csv = load_csv[config]

We do this instead of just passing a file location as we did in the simple example above.

All configurations are controlled via this config relation, which has the following structure (expressed as a module):

module config
def path = "..." // we must specify path or data
def data = "..."

module syntax
//...
end

module schema
//...
end

module integration
//...
end
end

The two main configuration options are:

keydescription
pathA string specifying the location and the name of the file to import. It can point to S3 or azure URLs, provided they are readable by the server.
dataA string constant or string relation specifying the actual data to be imported. If path is also present, data is ignored and only path is used.

In the following sections, we discuss in detail the options within the syntax, schema, and integration modules.

Syntax Options

As shown in the example above, the config:syntax relation specifies formatting options. For example, def config:syntax:delim = '|' changes the delimiter used to split the fields when importing.

The syntax field can use the following sub-fields as options:

keydescription
syntax:headerA relation with (Int, RelName) pairs, specifying the symbols to be used as column names, where the Int indexes indicate the column order. With this option, we can override the default (or existing) header names.
syntax:header_rowAn Int that specifies the input row that contains the header names, if they are not defined in the config through the header parameter above. Values smaller than 1 indicate that no header is present. If we specify a value smaller than 1, then Rel creates column names for us (see Defining Column Names). If no header_row value is specified, a default of 1 is used (i.e., using the first line of the file)
syntax:delimThe Char delimiter used in the file to separate the columns. The default delimiter is a comma ,.
syntax:quotecharA Char that indicates a quoted text field. A quoted text field is required to contain special characters like the delimiter or a newline character in a text. The default quote character is ".
syntax:escapecharThe Char used to escape the quote characters within a quoted (e.g., text) field. The default value is a backslash \.

Schema Options

keydescription
schemaA binary relation indicating the file schema. The pairs are of the form (RelName, String). The column name is specified as RelName and the data type as String (e.g.: (:a, ,"int")).

The default behavior is that everything is loaded as a string.

The supported data types are "int", "string", "float","decimal(n, digits)" where n is the number of bits and digits is the number of precision decimals, "date", and "datetime". Note that the type definitions are case sensitive.

Example:

def config:schema = {(:a, "int"); (:b, "string")}

Using module syntax, this can also be written as:

module config
module schema
def a = "int"
def b = "string"
end
end

Integration Options

keydescription
integrationAllows importing from the cloud. In this case we will also have to specify the integration:provider (e.g., “azure”) as well as the integration:credentials to be used for access to the cloud (e.g., a sas_token).

Importing Local CSV Files

To import a local CSV file using the RAI notebook:

  1. Upload the file as a string to a named relation, say, mydata, using the RAI notebook upload functionality.

  2. Use def config:data = mydata, and set other configuration options as needed.

  3. Do load_csv[config].

Importing Data from a String

The actual data can be provided as string. This is handy for small data sets and tests. This functionality is also used for importing local CSV files via the RAI Notebook.

To specify the data via a string, we need to set the data option

query
def config:data = """
a|b|c
1|2|3
4|5|6
"""
def config:syntax:delim = '|'

def output = load_csv[config]

Relation: output

:aRelationalAITypes.FilePos(7)"1"
:aRelationalAITypes.FilePos(13)"4"
:bRelationalAITypes.FilePos(7)"2"
:bRelationalAITypes.FilePos(13)"5"
:cRelationalAITypes.FilePos(7)"3"
:cRelationalAITypes.FilePos(13)"6"

Specifying Column Types (File Schema)

In the previous example, we loaded the data as strings because we didn’t indicate a specific data type. Now, let’s instruct Rel to load the data as integers.

In addition to the path option, we also specify the schema of the imported column mydata to be of type float:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-1col.csv"
def config[:schema, :mydata] = "float"

def csv = load_csv[config]
def output = csv

Relation: output

:mydataRelationalAITypes.FilePos(7)10.0
:mydataRelationalAITypes.FilePos(10)20.0
:mydataRelationalAITypes.FilePos(13)30.0
:mydataRelationalAITypes.FilePos(16)40.0

As we can observe now, the file data has been read as floating point numbers as specified with schema.

Importing More Than One Column

Let’s now turn to importing a file with more than one column. For example, consider the following data file with four columns, each with a different data type: string, integer, float, and date, respectively.

cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

file: simple-import-4cols.csv

Using the schema option, we can import this data in a similar way as in the previous example, simply by specifying the data type of each of the imported columns:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64, 2)"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(29)"martini"
:cocktailRelationalAITypes.FilePos(60)"sazerac"
:cocktailRelationalAITypes.FilePos(91)"cosmopolitan"
:cocktailRelationalAITypes.FilePos(127)"bellini"
:dateRelationalAITypes.FilePos(29)2020-01-01
:dateRelationalAITypes.FilePos(60)2020-02-02
:dateRelationalAITypes.FilePos(91)2020-03-03
:dateRelationalAITypes.FilePos(127)2020-04-04
:priceRelationalAITypes.FilePos(29)12.50
:priceRelationalAITypes.FilePos(60)14.25
:priceRelationalAITypes.FilePos(91)11.00
:priceRelationalAITypes.FilePos(127)12.25
:quantityRelationalAITypes.FilePos(29)2
:quantityRelationalAITypes.FilePos(60)4
:quantityRelationalAITypes.FilePos(91)4
:quantityRelationalAITypes.FilePos(127)3

which confirms that the data in each column has the data type that we specified above.

Defining Column Names

In the previous examples, we imported the file using the default column names, as defined in the first row of the file (this row number can be specified with the header_row parameter). Rel also lets us specify the column names we would like instead. Once we decide to specify the header ourselves, we must specify all the column names (i.e., we can’t skip some of them).

In our next example, we load the same data, but we change the imported column names: we rename cocktail to beverage, quantity to amount and price to value. We leave date as is. We do this using the syntax and header parameters within the config relation. The header parameter takes as input tuples of the form (col_pos, col_name) that indicate how we would like to rename each of the columns. col_pos refers to the column position as it appears in the file that we are importing.

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"

def config[:syntax, :header] = (1, :beverage); (2, :amount); (3, :value); (4, :date)

def config[:schema, :beverage] = "string"
def config[:schema, :amount] = "int"
def config[:schema, :value] = "decimal(64,2)"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:amountRelationalAITypes.FilePos(29)2
:amountRelationalAITypes.FilePos(60)4
:amountRelationalAITypes.FilePos(91)4
:amountRelationalAITypes.FilePos(127)3
:beverageRelationalAITypes.FilePos(29)"martini"
:beverageRelationalAITypes.FilePos(60)"sazerac"
:beverageRelationalAITypes.FilePos(91)"cosmopolitan"
:beverageRelationalAITypes.FilePos(127)"bellini"
:dateRelationalAITypes.FilePos(29)2020-01-01
:dateRelationalAITypes.FilePos(60)2020-02-02
:dateRelationalAITypes.FilePos(91)2020-03-03
:dateRelationalAITypes.FilePos(127)2020-04-04
:valueRelationalAITypes.FilePos(29)12.50
:valueRelationalAITypes.FilePos(60)14.25
:valueRelationalAITypes.FilePos(91)11.00
:valueRelationalAITypes.FilePos(127)12.25

We can indicate that there is no header in the file by setting the parameter syntax:header_row to 0 or any negative number.

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"

def config:syntax:header_row = -1

def csv = load_csv[config]
def output = csv

Relation: output

:COL1RelationalAITypes.FilePos(0)"cocktail"
:COL1RelationalAITypes.FilePos(29)"martini"
:COL1RelationalAITypes.FilePos(60)"sazerac"
:COL1RelationalAITypes.FilePos(91)"cosmopolitan"
:COL1RelationalAITypes.FilePos(127)"bellini"
:COL2RelationalAITypes.FilePos(0)"quantity"
:COL2RelationalAITypes.FilePos(29)"2"
:COL2RelationalAITypes.FilePos(60)"4"
:COL2RelationalAITypes.FilePos(91)"4"
:COL2RelationalAITypes.FilePos(127)"3"
:COL3RelationalAITypes.FilePos(0)"price"
:COL3RelationalAITypes.FilePos(29)"12.50"
:COL3RelationalAITypes.FilePos(60)"14.25"
:COL3RelationalAITypes.FilePos(91)"11.00"
:COL3RelationalAITypes.FilePos(127)"12.25"
:COL4RelationalAITypes.FilePos(0)"date"
:COL4RelationalAITypes.FilePos(29)"2020-01-01"
:COL4RelationalAITypes.FilePos(60)"2020-02-02"
:COL4RelationalAITypes.FilePos(91)"2020-03-03"
:COL4RelationalAITypes.FilePos(127)"2020-04-04"

In this case, the system assigned default COLX names to the columns. Additionally, it considered the first row (which actually contains the column headers) as actual data and imported that data as well.

Since we didn’t specify the types, all data were read in as strings. We can specify the schema using the COLX attributes. For example for the simple file:

"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

file: simple-import-4cols-noheader.csv

we can specify the types of the columns as follows:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-4cols-noheader.csv"

def config[:syntax, :header_row] = -1

def config[:schema, :COL1] = "string"
def config[:schema, :COL2] = "int"
def config[:schema, :COL3] = "decimal(64,2)"
def config[:schema, :COL4] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:COL1RelationalAITypes.FilePos(0)"martini"
:COL1RelationalAITypes.FilePos(31)"sazerac"
:COL1RelationalAITypes.FilePos(62)"cosmopolitan"
:COL1RelationalAITypes.FilePos(98)"bellini"
:COL2RelationalAITypes.FilePos(0)2
:COL2RelationalAITypes.FilePos(31)4
:COL2RelationalAITypes.FilePos(62)4
:COL2RelationalAITypes.FilePos(98)3
:COL3RelationalAITypes.FilePos(0)12.50
:COL3RelationalAITypes.FilePos(31)14.25
:COL3RelationalAITypes.FilePos(62)11.00
:COL3RelationalAITypes.FilePos(98)12.25
:COL4RelationalAITypes.FilePos(0)2020-01-01
:COL4RelationalAITypes.FilePos(31)2020-02-02
:COL4RelationalAITypes.FilePos(62)2020-03-03
:COL4RelationalAITypes.FilePos(98)2020-04-04

where we can now see that we are reading in the data with their correct types, i.e., string, int, decimal and date.

Going back to the data with the headers, we can also use the second row to define the column names:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"

def config[:syntax, :header_row] = 2

def csv = load_csv[config]
def output = csv

Relation: output

:12.50RelationalAITypes.FilePos(60)"14.25"
:12.50RelationalAITypes.FilePos(91)"11.00"
:12.50RelationalAITypes.FilePos(127)"12.25"
:2RelationalAITypes.FilePos(60)"4"
:2RelationalAITypes.FilePos(91)"4"
:2RelationalAITypes.FilePos(127)"3"
:2020-01-01RelationalAITypes.FilePos(60)"2020-02-02"
:2020-01-01RelationalAITypes.FilePos(91)"2020-03-03"
:2020-01-01RelationalAITypes.FilePos(127)"2020-04-04"
:martiniRelationalAITypes.FilePos(60)"sazerac"
:martiniRelationalAITypes.FilePos(91)"cosmopolitan"
:martiniRelationalAITypes.FilePos(127)"bellini"

Specifying header_row=2 tells the system where to start reading from within the file. The first row of cocktail,quantity,price,date is ignored. We start at line 2, consider that line as a header, and then load the rest of the data. In general, all rows above the header_row number are ignored.

This option can be useful in certain cases where a number of lines in the beginning needs to be ignored. For example, in the case when there is a certain number of lines in the beginning of the file (e.g., starting with #) that may describe what the file is about before the actual header and data. In this way, we can load the header and data without having to edit the file to remove such comments at the top.

Using Different Delimiters

We can change the delimiters used to read the data. Consider the following file, which uses | as a delimiter:

cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20

file: simple-import-1col-pipedelim.csv

We can read such data by specifying the delim parameter in the config relation:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-1col-pipedelim.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def config[:syntax, :delim] = '|'

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(15)"sazerac"
:cocktailRelationalAITypes.FilePos(28)"martini"
:cocktailRelationalAITypes.FilePos(41)"cosmopolitan"
:priceRelationalAITypes.FilePos(15)30
:priceRelationalAITypes.FilePos(28)24
:priceRelationalAITypes.FilePos(41)20

Loading TSV Files

For the common case of loading a tab-separated file (TSV), set the delimiter to \t:

def config:syntax:delim = '\t'

The other config options will work in the same way as discussed in this document for CSVs.

Loading the Relation Key from File

In the previous examples, we saw that the key of the files was coming as a file offset within the file. We now show how to use a column specified within the file as a key of the relation. For example, consider the following data, where we wish the first column id to be the key:

id,cocktail
11,"martini"
22,"sazerac"
33,"cosmopolitan"
44,"bellini"

file: simple-import-2cols-with-key.csv

To use the id column as the key, we can proceed as follows: First, we read in the data in a similar way to the examples that we have seen so far. Then, we convert the csv relation to another relation, cocktails, which replaces the FilePos row key with the values in the id column:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-2cols-with-key.csv"
def config[:schema, :id] = "int"
def config[:schema, :cocktail] = "string"

def csv = load_csv[config]

def cocktails(id, :cocktail, cocktail) =
csv(:id, pos, id) and csv(:cocktail, pos, cocktail)
from pos

Relation: cocktails

11:cocktail"martini"
22:cocktail"sazerac"
33:cocktail"cosmopolitan"
44:cocktail"bellini"

Please note that the key used should be unique to each row. It is a good practice to ensure that this key is unique using an integrity constraint.

Replacing FilePos with Row Numbers

In the examples so far, we have used either the file position as a key for each row, or, as in the previous example, we loaded the key directly from the input file. Rel provides us with another option for loading the data, using the row number as the key. To do this, we pass load_csv output relation into the lined_csv utility, as follows:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/simple-import-4cols.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64, 2)"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def rows_csv = lined_csv[csv]

Relation: rows_csv

:cocktail1"martini"
:cocktail2"sazerac"
:cocktail3"cosmopolitan"
:cocktail4"bellini"
:date12020-01-01
:date22020-02-02
:date32020-03-03
:date42020-04-04
:price112.50
:price214.25
:price311.00
:price412.25
:quantity12
:quantity24
:quantity34
:quantity43

We can see that we now have successive row ids as keys instead of the file positions. Note that lined_csv only counts lines that contain data, including errors, while headers and empty lines are not counted.

Loading Multiple Files in the Same Relation

In certain cases, we wish to load data from several different files into the same relation. For example, we could have two files that correspond to the sales over the period of two different months, or we could have a dataset that was simply split in several files. Here are two such example files showing cocktails and units sold in the months of March and April:

cocktail,units
martini,12
sazerac,23

file: cocktail-units-march.csv

cocktail,units
martini,12
bellini,32

file: cocktail-units-april.csv

In order to load the data in this case, we need to use load_csv twice, one for each file. Additionally, we are specifying an additional column to help us distinguish from which file data is coming from:

query
def config_march[:path] = "s3://relationalai-documentation-public/csv-import/cocktail-units-march.csv"
def config_march[:schema, :cocktail] = "string"
def config_march[:schema, :units] = "int"

def config_april[:path] = "s3://relationalai-documentation-public/csv-import/cocktail-units-april.csv"
def config_april[:schema, :cocktail] = "string"
def config_april[:schema, :units] = "int"

def csv = "march", load_csv[config_march]
def csv = "april", load_csv[config_april]

def output = csv

Relation: output

"april":cocktailRelationalAITypes.FilePos(15)"martini"
"april":cocktailRelationalAITypes.FilePos(26)"bellini"
"april":unitsRelationalAITypes.FilePos(15)12
"april":unitsRelationalAITypes.FilePos(26)32
"march":cocktailRelationalAITypes.FilePos(15)"martini"
"march":cocktailRelationalAITypes.FilePos(26)"sazerac"
"march":unitsRelationalAITypes.FilePos(15)12
"march":unitsRelationalAITypes.FilePos(26)23

Examining the output, we observe that we have an additional string, either march or april for each of the data. When we import data from several files, it is useful to keep in mind that due to how data is loaded and the fact that the file positions are the key we may have data loss.

To make this more specific, consider the following code which loads again the two files in the same csv relation, but, in this case, we do not add the month as an additional column:

query
def config_march[:path] = "s3://relationalai-documentation-public/csv-import/cocktail-units-march.csv"
def config_march[:schema, :cocktail] = "string"
def config_march[:schema, :units] = "int"

def config_april[:path] = "s3://relationalai-documentation-public/csv-import/cocktail-units-april.csv"
def config_april[:schema, :cocktail] = "string"
def config_april[:schema, :units] = "int"

def csv = load_csv[config_march]
def csv = load_csv[config_april]

def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(15)"martini"
:cocktailRelationalAITypes.FilePos(26)"bellini"
:cocktailRelationalAITypes.FilePos(26)"sazerac"
:unitsRelationalAITypes.FilePos(15)12
:unitsRelationalAITypes.FilePos(26)23
:unitsRelationalAITypes.FilePos(26)32

In this case, we obtain the tuple martini,12 only once in the loaded data. This is for two reasons. First, the tuple is the same in both files. In addition, it also happens that the tuples appear in the exact same file position (specifically, FilePos(15)). As a result, tuples like (:cocktail, FilePos(15), "martini") will get generated by load_csv for both files. Because we are performing a union over both loaded datasets, duplicate tuples will be dropped and tuples like (:cocktail, FilePos(15), "martini") will appear only once in csv.

Loading from the Cloud

Rel supports importing data from the cloud. Here is an example that showcases how to set the parameters for importing from Azure Blob Storage through the use of the integration option within configuration relation. Please note that you will have to provide a valid sas_token and azure_url for the code to work. These tokens should be retrieved by the user through their valid Azure account and provided in the code below:

def config[:path] = "azure://myaccount.blob.core.windows.net/sascontainer/myfile.txt"
def config[:integration, :provider] = "azure"
def config[:integration, :credentials, :azure_sas_token] =
"sv=2014-02-14&sr=b&si=TestPolicy&sig=o%2B5%2F0C%2BLm7tWWftNKvQEGKHlSt%2Bfs8No7FZkUk5T%2Bv0%3D"

def csv = load_csv[config]

Finding and Handling Data Importing Errors

Sometimes, there are import errors that we have not anticipated or, simply, cannot be handled. In either case, load_csv supports error handling to help us understand what may have gone wrong.

When the CSV input file cannot be reliably read (e.g., it is formatted incorrectly), Rel attempts to parse as many rows as possible. For any rows that it was not able to reliably parse, they are stored in the :load_errors sub-relation. This sub-relation can be accessed the same way as the rest of the sub-relations in the imported relation (e.g. csv in the previous examples). The :load_errors relation has the format (:load_errors, file_position, line, explanation).

Error in Formatting

Here is an example file that contains two formatting errors, i.e., an extra column per line for the first two lines:

cocktail,price
"sazerac",30,"extra"
"martini",24,"another_extra"
"cosmopolitan",20

file: import-errors-extracol.csv

In this example file, the first line contains the header which indicates that we are expecting two columns in the file. The second line, however, has an error that there is one extra field, and so does the third line. Finally, the last line is correct.

We will attempt to import this data by considering that the first column should be a string, while the second column should be an integer. We can do this using the following code:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/import-errors-extracol.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(65)"cosmopolitan"
:load_errorsRelationalAITypes.FilePos(15)0"sazerac,30,extra"
:load_errorsRelationalAITypes.FilePos(36)0"martini,24,another_extra"
:priceRelationalAITypes.FilePos(65)20

This output specifies that there were two load_errors, and displays the erroneous lines as the last field. The number right after :load_errors indicates the 1-based index of the field where Rel believes the error to be, or 0 if the whole line could not be parsed for some reason (as is the case here).

Ensuring no Errors

Before proceeding to additional error examples and how to handle them, we should note that one easy way to ensure that when we are loading data there are no loading errors is through the use of an integrity constraint.

Such a constraint could just count the number of :load_errors in the resulting relation (csv in our examples).

If the number of occurrences is larger than zero, then the loading did not complete successfully and the constraint would be violated.

Here is the same example as before with the addition of an integrity constraint, no_load_errors, that is violated when we attempt to load problematic data:

def config[:path] = "s3://relationalai-documentation-public/csv-import/import-errors-extracol.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def csv = load_csv[config]

ic no_load_errors(x...) { csv(:load_errors, _, x...) }

We can add similar constraints to check for various kinds of data validation aspects.

Error in the Data Type

Here is another example where we specify the wrong type for the data that we have. Specifically, although there is no real error in data, we specify wrong data types for each column (i.e., we specify int for the cocktail column and string for the price). Here is the data:

cocktail,price
"sazerac",30
"martini",24
"cosmopolitan",20

file: import-errors-wrongtype.csv

Next, we try to load the data as follows (please note the wrong data type definitions):

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/import-errors-wrongtype.csv"
def config[:schema, :cocktail] = "int"
def config[:schema, :price] = "string"

def csv = load_csv[config]
def output = csv

Relation: output

:load_errorsRelationalAITypes.FilePos(15)1"sazerac,30"
:load_errorsRelationalAITypes.FilePos(28)1"martini,24"
:load_errorsRelationalAITypes.FilePos(41)1"cosmopolitan,20"
:priceRelationalAITypes.FilePos(15)"30"
:priceRelationalAITypes.FilePos(28)"24"
:priceRelationalAITypes.FilePos(41)"20"

The system is telling us that it found 3 errors in the data, and specifies the lines for the errors. It correctly claims that it could not convert the cocktail names to int and therefore could not correctly parse those lines.

At the same time, Rel is following a best-effort approach to parse as much data as possible. Although it was not able to convert strings to int, it could, of course, read in the integers of the price column as string. And, as a result, it imported that data, which is the last three lines of the output that we see above.

In summary, keep in mind that, even if a line of the file may contain an error, Rel may still parse and import part of that line.

Empty Lines

Another example of this behavior is when there are empty lines within a file, Rel will completely ignore them and continue parsing in the next line with data. For example, for the following data:

cocktail,price
"sazerac",30


"martini",24


"cosmopolitan",20


file: import-errors-emptylines.csv

we can read the data using the following code:

query
def config[:path] = "s3://relationalai-documentation-public/csv-import/import-errors-emptylines.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(15)"sazerac"
:cocktailRelationalAITypes.FilePos(30)"martini"
:cocktailRelationalAITypes.FilePos(45)"cosmopolitan"
:priceRelationalAITypes.FilePos(15)30
:priceRelationalAITypes.FilePos(30)24
:priceRelationalAITypes.FilePos(45)20

The result shows that the empty lines were not considered errors, and the entire file was successfully loaded into the relation csv.

Missing Values

Datasets might simply not contain one or more values per row. Here is an example data load where the price value is missing from the third and fifth rows:

query
def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64,2)"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(30)"martini"
:cocktailRelationalAITypes.FilePos(61)"sazerac"
:cocktailRelationalAITypes.FilePos(87)"cosmopolitan"
:cocktailRelationalAITypes.FilePos(123)"bellini"
:dateRelationalAITypes.FilePos(30)2020-01-01
:dateRelationalAITypes.FilePos(61)2020-02-02
:dateRelationalAITypes.FilePos(87)2020-03-03
:dateRelationalAITypes.FilePos(123)2020-04-04
:priceRelationalAITypes.FilePos(30)12.50
:priceRelationalAITypes.FilePos(87)11.00
:quantityRelationalAITypes.FilePos(30)2
:quantityRelationalAITypes.FilePos(61)4
:quantityRelationalAITypes.FilePos(87)4
:quantityRelationalAITypes.FilePos(123)3

The fields with the missing values were skipped and not imported, but note that there are no :load_errors in the result.

For our next example, we load the same dataset as before, except we represent missing values with the string NA:

query
def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64,2)"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(30)"martini"
:cocktailRelationalAITypes.FilePos(61)"sazerac"
:cocktailRelationalAITypes.FilePos(89)"cosmopolitan"
:cocktailRelationalAITypes.FilePos(125)"bellini"
:dateRelationalAITypes.FilePos(30)2020-01-01
:dateRelationalAITypes.FilePos(61)2020-02-02
:dateRelationalAITypes.FilePos(89)2020-03-03
:dateRelationalAITypes.FilePos(125)2020-04-04
:load_errorsRelationalAITypes.FilePos(61)3"sazerac,4,NA,2020-02-02"
:load_errorsRelationalAITypes.FilePos(125)3"bellini,3,NA,2020-04-04"
:priceRelationalAITypes.FilePos(30)12.50
:priceRelationalAITypes.FilePos(89)11.00
:quantityRelationalAITypes.FilePos(30)2
:quantityRelationalAITypes.FilePos(61)4
:quantityRelationalAITypes.FilePos(89)4
:quantityRelationalAITypes.FilePos(125)3

We get the same result as before, plus two load_errors. This is because we specified that column 3 (which is missing from two rows) is a decimal(64,2). Since the “NA” strings cannot be parsed as decimal(64,2), the system considers that there was an error in these rows. Similar to the previous examples, Rel did a best-effort approach and imported whatever was possible.

Let’s now load the same file, specifying instead that the third column has type “string”:

query
def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "string"
def config[:schema, :date] = "date"

def csv = load_csv[config]
def output = csv

Relation: output

:cocktailRelationalAITypes.FilePos(30)"martini"
:cocktailRelationalAITypes.FilePos(61)"sazerac"
:cocktailRelationalAITypes.FilePos(89)"cosmopolitan"
:cocktailRelationalAITypes.FilePos(125)"bellini"
:dateRelationalAITypes.FilePos(30)2020-01-01
:dateRelationalAITypes.FilePos(61)2020-02-02
:dateRelationalAITypes.FilePos(89)2020-03-03
:dateRelationalAITypes.FilePos(125)2020-04-04
:priceRelationalAITypes.FilePos(30)"12.50"
:priceRelationalAITypes.FilePos(61)"NA"
:priceRelationalAITypes.FilePos(89)"11.00"
:priceRelationalAITypes.FilePos(125)"NA"
:quantityRelationalAITypes.FilePos(30)2
:quantityRelationalAITypes.FilePos(61)4
:quantityRelationalAITypes.FilePos(89)4
:quantityRelationalAITypes.FilePos(125)3

There are no errors now.

Such files can thus be loaded using “string” as the type of the missing column; the string values can be cleaned, filtered, and converted to their intended type using Rel, after the load.

Error in Delimiter

In the following example, the data is correct but use a different delimiter (|) than the one specified erroneously in our code (;):

query
def config[:data] = """
cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"

def config[:delim] = ';'

def csv = load_csv[config]
def output = csv

Relation: output

:cocktail|priceRelationalAITypes.FilePos(16)"sazerac|30"
:cocktail|priceRelationalAITypes.FilePos(29)"martini|24"
:cocktail|priceRelationalAITypes.FilePos(42)"cosmopolitan|20"

Although we may have expected an error, the system performed as expected. Since it did not find any of the specified delimiters, it considered that there was only one column within the file, and read the file accordingly.

Reusing Configurations

If we need to do a number of different CSV imports that have a common configuration, we can share and reuse a partial configuration by making a base module with the shared settings, and then extending it for each import. For example:

query
module baseconfig
def syntax:delim = '|'
def schema:a = "int"
def schema:b = "float"
end

def config = baseconfig
def config:schema:c = "string"
def config:data = """
a|b|c
1|2.3|foo
4|5.1|bar
"""

def output = load_csv[config]

Relation: output

:aRelationalAITypes.FilePos(7)1
:aRelationalAITypes.FilePos(17)4
:bRelationalAITypes.FilePos(7)2.3
:bRelationalAITypes.FilePos(17)5.1
:cRelationalAITypes.FilePos(7)"foo"
:cRelationalAITypes.FilePos(17)"bar"