CSV Import

This how-to guide demonstrates how to import CSV data using Rel.

Goal

By following this guide, you will be able to import CSV data into the RKGS database and learn how to modify the import configurations to handle any type of CSV format and data location.

Introduction

The Comma Separated Values (CSV) data format is one of the most common, and CSV data is ubiquitous. This how-to guide demonstrates how to import a wide range of CSV-like data with Rel, and describes all of the CSV import configuration options.

The `load_csv` Relation

The main way to load CSV data into the system is through the built-in Rel relation load_csv.

def config = ...
def my_csv = load_csv[config]

The relation config is used to define all CSV load configurations. For details on all the available options, see Import Options below.

Getting Started

This guide will begin with a simple example. For a more detailed one, see the comprehensive example below. Consider the following CSV file with two string columns available in this Azure container (opens in a new tab):

country, drink
Spain,Gazpacho
Spain,Sangría
Argentina,"Yerba Mate"
United States,"Coca-Cola"
Italy,San Pellegrino

The following Rel code loads this file into a relation called my_data, and displays it using the table utility:

// read query
 
def my_data = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]
 
def output = ::std::display::table[my_data]

The imported CSV data are stored in the my_data relation, which is a Rel module. Each column in the original CSV is assigned to a subrelation, which is given the same name as the column header from the CSV. File line numbers are used as the key to connect row entries. For more details, see CSV Data Schema in the Working With CSV Data concept guide.

The table utility tells the rendering tool to display the imported data as a table.

By default, all columns are loaded as strings. There is no automatic type conversion, so ...,123,... becomes the string "123".

Defining Import Options

To specify data types for each column, include them in a config relation, which is a module that defines the import configuration. See the Modules concept guide to learn more about modules in Rel.

🔎

Note that config is an arbitrary module name and will be used throughout this guide, but you can give it any other name.

Consider this data:

cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

config:schema specifies data types for each column:

// write query
 
// Set the import configuration.
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
    }
end
 
// Import CSV data as a base relation.
def insert:my_data = load_csv[config]
 
// Display the imported data.
def output = ::std::display::table[my_data]

🔎

The code above stores the imported CSV data in the base relation my_data.

You will usually want to store imported data as base relations, and then define views as derived relations based on the base relation data. See Working With Base Relations for more details.

The statement def insert:my_data = load_csv[config] saves the imported data in the base relation my_data. To add data to a new or existing base relation, you can use the control relation insert. Note that the import is done within a write query, and the config definition is only present for the duration of the query. For more details, see Working with Base Relations.

Querying CSV Data

Now you can query the imported CSV data by joining the columns of interest by their common key, i.e., the file position:

// read query
 
def output(c, d) =
    my_data:cocktail(pos, c) and my_data:date(pos, d)
    from pos

See CSV Data Schema for more details on how CSV data are stored in the RelationalAI RKGS.

Import Options

For very simple situations, where no configuration needs to be specified, just provide the file path:

def data = load_csv["azure://...."]

For more customized imports, load_csv takes a configuration relation as an argument.

module config
    // ...
end
def data = load_csv[config]

🔎

The configuration relation is usually named config but any other name can be used.

A good practice is to make the configuration definition transient using a read query, so it will not affect other data loads.

However, you can also install load configurations. A common use case for this is defining template configurations. See section Reusing Configurations for more details.

Within the load configuration, you can specify:

Option	Description
`path`	A URL for the CSV data.
`data`	An alternative way to specify the data itself.
`integration`	Credentials needed to access the data.
`schema`	Schema of the CSV data.
`syntax`	Syntax of the CSV data.

⚠

These configurations must be specified in a named module and cannot be supplied directly in the argument of load_csv. In other words, patterns like the following will result in an error:

load_csv[{
    (:path, "azure://.../data.csv");
    (:syntax, :delim, '\t')
}]

Path

The import option path is a string that specifies the location and name of the file you want to import. Currently, this string can point to azure://... (Microsoft Azure) or s3://... (Amazon S3) URLs. Refer to Accessing the Cloud for further details.

See Accessing the Cloud for further details on which storage cloud providers are supported. More information about what file types are supported can be found in Supported Data Files.

Example:

module config
    def path = "azure://myaccount.blob.core.windows.net/container/file.csv"
end

Data

The import option is a string specifying the actual data to be imported. If path is also present, the data field is ignored and only path is used for importing the data.

Example:

module config
    def data = "a,b,c\n1,2,3\n4,5,6"
end

You can also use the multi-line string functionality to write CSV data with several rows.

Integration

You may need to specify credentials to access your data in cloud storage, using the import option integration. See the Accessing the Cloud guide for more details.

Schema

The import option schema is a binary relation indicating the file schema. The pairs are of the form (RelName, String). The column name is specified as a Symbol (of type RelName) and the data type as String (e.g.: (:a, ,"int")).

The supported data types are "int", "string", "float","decimal(n, digits)", "date", "datetime" and "bool". For decimals, n is the number of bits and digits is the number of decimal digits, which should correspond to a valid Fixed Decimal Rel Type. See the Date Rel Type and Datetime Rel Type docs for more about the supported formats. For booleans, you can use one of the strings: "true", "True", "1", "T" to represent true, and one of: "false", "False", "0", "F" to represent false. The default behavior is that everything is loaded as a string. Note that the type definitions are case sensitive.

Example:

module config
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
        (:happyhour, "bool");
    }
end

Syntax

The import option syntax is a module with sub-fields and is used to specify the CSV file format. The module syntax has the following sub-fields as options:

Option	Description
`syntax:header`	A relation with `(Int, RelName)` pairs, specifying the symbols to be used as column names, where the `Int` indices indicate the column order. This option overrides the default (or existing) header names.
`syntax:header_row`	An `Int` that specifies the row from where to start importing the file. The column names are specified by the content of that row, if they are not defined using the `header` option above. The values `-1` and `0` indicate that no header is present and the system creates column names for you. See Defining a Header Row. If no `header_row` value is specified, the value `1` is used as default (that is, the first line of the file is used).
`syntax:datarow`	An `Int` that specifies the row from where to begin parsing values.
`sytnax:missingstrings`	A `String` or multiple instances of `String` that should be interpreted as missing values. By default, only empty fields are considered missing.
`syntax:delim`	The ASCII `Char` delimiter used in the file to separate the columns. The default delimiter is a comma `,`.
`syntax:quotechar`	An ASCII `Char` that indicates a quoted text field. A quoted text field is required to contain special characters like the delimiter or a newline character in a text. The default quote character is `"`.
`syntax:escapechar`	An ASCII `Char` used to escape the quote characters within a quoted — for example, text — field. The default value is a backslash `\`.
`syntax:decimalchar`	An ASCII `Char` to be used when parsing float values with decimal parts. The default is `.`.
`syntax:groupmark`	An ASCII `Char` denoting the number grouping mark. This is useful for parsing numbers with separators (for example, for thousands — like `1,000.00` — you can set `groupmark=','`. By default, group marks are not permitted in numbers.

Example:

module config
    module syntax
        def delim = '|'
        def header_row = -1
    end
end

🔎

schema and syntax options are applied to all types of data, regardless of their source (cloud, local, string literal, etc.).

CSV Data Model

In Rel, data is best stored in Graph Normal Form (GNF), a fully normalized representation that takes advantage of Rel features such as modularity and specialization.

CSV data imported via load_csv follow basic GNF conventions. Concretely, the CSV data are represented as triplets of the form (colname, filepos, value).

// read query
 
def output = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]

For more details, see CSV Data Schema in the Working With CSV Data concept guide.

Data Sources

This section covers the available options to import CSV data.

Strings

You can directly specify CSV data as a string. This is handy for small datasets and tests.

Here’s an example:

// read query
 
def mydata = """
a|b|c
1|2|3
4|5|6
"""
 
module config
    def data = mydata
    def syntax:delim = '|'
end
 
def output = ::std::display::table[load_csv[config]]

Local Files

You can load local CSV data using the importing functionality from the RAI Console.

The uploaded file is stored as a string in a base relation, say, my_uploaded_file. To process the file as CSV data, use the load_csv functionality with config:data pointing to that base relation:

def config:data(text) = my_uploaded_file(text) and String(text)
def my_data = load_csv[config]

The condition String(text) ensures that only string values get assigned to config:data. See the next section, Strings, for details on how to specify CSV data directly as a string.

Using the schema option, you can specify the data type of each of the imported columns:

module config
    def data = my_uploaded_file
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
    }
end
 
// Import CSV data as a base relation.
def insert:my_data = load_csv[config]

⚠

You can import files up to 64MB in size. If you try to import a larger file, you will receive a validation error. This limit applies to local files only; no limit is imposed for imports from cloud sources.

Cloud

You can import public and private CSV data from the supported cloud providers.

Here’s an example that loads public data:

module config
    def path = "s3://my-s3-bucket/myfile.csv"
end
 
def my_csv = load_json[config]

See Accessing the Cloud for further details.

Compressed Files

You can import compressed CSV files from the cloud. No specific options are required, so you can just provide the file path with the compression extension. For example:

module config
    def path = "azure://myaccount.blob.core.windows.net/sascontainer/myfile.csv.gz"
end
 
def my_data = load_csv[config]

Currently only gzip compression — ending in .gz — is supported. See Compressed Data for more details.

Schema Options

Column Data Types

Consider the following data file with four columns, each with a different data type: string, integer, float, and date.

cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

file: simple-import-4cols.csv (opens in a new tab)

Using the schema option, you can import this file and specify the data type of each of the imported columns:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
    }
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

This confirms that the data in each column has the data type that was specified above.

Note that quotes around individual values are stripped; to escape quotes, the default is \". Any spaces before or after a comma are preserved (as is standard with CSVs), with the exception of the header row.

Column Names

The previous examples imported the file using the default column names, as defined in the first row of the file (this row number can be specified with the header_row parameter). Rel also lets you specify any column names you would like. Once you decide to specify the header, you must specify all the column names. You’re unable to skip any of them.

The next example loads the same data, but changes the imported column names: it renames cocktail to beverage, quantity to amount, and price to value. (The name date is not changed). This is done using the syntax and header parameters within the config relation. The header parameter takes as input tuples of the form (col_pos, col_name), which indicate how to rename each of the columns. col_pos refers to the column position as it appears in the imported file.

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def syntax:header = {
        (1, :beverage); (2, :amount);
        (3, :value); (4, :date)
    }
 
    def schema = {
        (:beverage, "string");
        (:amount, "int");
        (:value, "decimal(64,2)");
        (:date, "date");
    }
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

No Header Information

To indicate that there is no header in the file, set the parameter syntax:header_row to 0 or -1:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def syntax:header_row = -1
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

In this case, the system assigned default COLX names to the columns. Additionally, it considered the first row (which actually contains the column headers) as actual data and imported that data as well.

Since no types were specified, all data were read in as strings. You can specify the schema using the COLX attributes. For example for the simple file:

"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

file: simple-import-4cols-noheader.csv (opens in a new tab)

you can specify the types of the columns as follows:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols-noheader.csv"
 
    def syntax:header_row = -1
 
    def schema = {
        (:COL1, "string");
        (:COL2, "int");
        (:COL3, "decimal(64,2)");
        (:COL4, "date");
    }
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

The data are now imported with their correct types: string, int, decimal, and date.

Syntax Options

Defining a Header Row

Going back to the data with the headers, you can also use the second row to define the column names.

Specifying header_row = 2 tells the system where to start reading within the file. The first row, i.e., cocktail,quantity,price,date, is ignored. The system starts at line 2, considers that line as a header, and then loads the rest of the data.

In general, all rows above the header_row are ignored. This option is useful when a number of lines at the beginning of the file need to be ignored. One example is if there are comments before the actual header and data that describe what the file is about. By specifying the right header_row, you can load the header and subsequent data without having to edit the file to remove any comments at the top.

The example below clearly generates incorrect results, since row 2 is not, in fact, a header file:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def syntax:header_row = 2
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

In this case, the correct solution is to not define the header_row option within config, since the imported CSV file starts with the header row. This is equivalent to def syntax:header_row = 1.

Modifying Delimiters

You can change the delimiters used to read the data. Consider the following file, which uses | as a delimiter:

cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20

file: simple-import-1col-pipedelim.csv (opens in a new tab)

You can read such data by specifying the delim parameter in the config relation:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-1col-pipedelim.csv"
 
    def schema = {
        (:cocktail, "string");
        (:price, "int");
    }
 
    def syntax:delim = '|'
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

Importing Numbers With Group Marks

In some cases, the data within a CSV file may contain numbers that have group marks, for example, 1,000,000.00. You can import such numbers by specifying the schema of the CSV file, which allows the system to know where to expect a float. Additionally, you can use the groupmark option. Here’s an example:

// read query
 
module config
    def data = """
    cocktail,sales
    sazerac,1,000.00
    daquiri,2,000.00
    """
 
    def schema = {
        (:cocktail, "string");
        (:sales, "float");
    }
 
    def syntax:groupmark = ','
end
 
def csv = load_csv[config]
 
def output = ::std::display::table[csv]

Note that by removing the syntax:groupmark = ',' option, the loading results in an error.

Omitting Top Rows

In some cases, the first few lines of a CSV file may contain comments or information that will not parse correctly. You can omit loading a specific number of lines at the beginning of the file by using the syntax:datarow option as follows:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def syntax:datarow = 3
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

This omits the first two rows and loads the rest into the csv relation. Note that by not specifying information about the header, the first row was used as a header by default.

To ensure that the first row doesn’t serve as a header, you also need to use the header_row=-1 option:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def syntax:header_row = -1
    def syntax:datarow = 3
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

Here, the columns have generic names, such as COL1.

Loading TSV Files

For the common case of loading a tab-separated file (TSV), set the delimiter to \t:

module config
    def syntax:delim = '\t'
end

The other config options apply similarly as the ones for CSV files.

Comprehensive Example

Here’s a more comprehensive import example, which you can copy and paste. It touches on multiple CSV loading aspects users will encounter when working with CSV data.

// write query
 
def mydata = """
first,last,age,plays_for
Leonel,Messi,32,PSG
Cristiano,Ronaldo,36,"Manchester United"
"""
 
// Set the import configuration.
module config
    def data = mydata
 
    module schema
        def first = "string"
        def last = "string"
        def age = "int"
        def plays_for = "string"
    end
 
    module syntax
      def header_row = 1
      def delim = ','
 
      // To change the header names:
      // `def header = {(1, :firstname); (2, :lastname); (3, :age); (4, :team)}`.
    end
end
 
// Define the intermediate derived relation.
def csv = load_csv[config]
 
// Persist the loaded CSV data in the database.
def delete[:csv_base_relation] = csv_base_relation // Clear potential previous contents of base relation data.
def insert[:csv_base_relation] = csv
 
// Display the persisted data.
def output = ::std::display::table[csv_base_relation]

The CSV data are included as a string constant to make this example as self-contained as possible. To load CSV data from the cloud, see Cloud.

The Rel code above is executed as a write query. It inserts the loaded CSV data into the database and persists them in the base relation csv_base_relation. The command def delete[:csv_base_relation] = csv_base_relation ensures the new CSV data are inserted into a blank base relation and not appended to an existing one. If this behavior is not wanted, you can remove this line.

The intermediate derived relation csv is not persisted in the database. It is, however, still useful to first assign the loaded CSV data to an intermediate derived relation. This is especially helpful when data cleaning or filtering steps should be performed before persisting the loaded CSV data in the database.

Import Scenarios

This section presents the most common CSV import use cases.

Replacing FilePos With Continuous Numbering

The examples so far have used the type FilePos as a key for each row. Rel provides another option for loading the data: passing the load_csv output relation to the lined_csv utility. For instance:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
    }
end
 
def csv = load_csv[config]
def rows_csv = lined_csv[csv]
def output = ::std::display::table[rows_csv]

The result now has continuous numbering, i.e., successive IDs as keys. This is not always guaranteed when using only load_csv. Note that lined_csv only counts lines that contain data, including errors, while headers and empty lines are not counted.

Using a Column as the Primary Key

The previous examples used the file row number as the key to identify each row. This key is known as the primary key. Instead, you can specify another data column as the primary key. For example, consider the following data, where the first column id is meant to be the primary key:

id,cocktail
11,"martini"
22,"sazerac"
33,"cosmopolitan"
44,"bellini"

file: simple-import-2cols-with-key.csv (opens in a new tab)

To use the id column as the key, you can proceed as follows: First, read in the data in a similar way to the examples so far. Then, convert the csv relation to another relation, cocktails, which replaces the FilePos row key with the values in the id column:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-2cols-with-key.csv"
 
    def schema = {
        (:id, "int");
        (:cocktail, "string");
    }
end
 
def csv = load_csv[config]
 
def cocktails(:cocktail, id, cocktail) =
    csv(:id, pos, id) and csv(:cocktail, pos, cocktail)
    from pos
 
def output = ::std::display::table[cocktails]

Please note that the key used should be unique to each row. It is good practice to ensure that this key is unique using an integrity constraint.

In general, the loaded CSV can be transformed in multiple ways before it is used by the rest of the model or saved as a base relation. This corresponds to an ELT (extract-load-transform) approach and can be more flexible than the traditional ETL (extract-transform-load), which transforms before the load.

Loading Multiple Files in the Same Relation

Certain cases call for loading data from several different files into the same relation. For example, you could have two files that correspond to sales over the period of two different months, or a dataset that was simply split into several files (as discussed in more detail in the following section). Here are two such example files showing cocktails and units sold in the months of March and April:

cocktail,units
martini,12
sazerac,23

file: cocktail-units-march.csv (opens in a new tab)

cocktail,units
martini,12
bellini,32

file: cocktail-units-april.csv (opens in a new tab)

To load the data in this case, you need to use load_csv twice, one for each file. In order to distinguish data by file, you can use the month name as an additional key for the import configuration module config and the imported data csv:

// read query
 
def my_schema = {
    (:cocktail, "string");
    (:units, "int");
}
 
module config["march"]
    def path = "azure://raidocs.blob.core.windows.net/csv-import/cocktail-units-march.csv"
    def schema = my_schema
end
 
module config["april"]
    def path = "azure://raidocs.blob.core.windows.net/csv-import/cocktail-units-april.csv"
    def schema = my_schema
end
 
def csv[month] = load_csv[config[month]]
 
def output = ::std::display::table[csv]

Examining the output, observe that you have an additional string, either "march" or "april" identifying each dataset.

Multipart File Loads

In cases where a large data set is split up into different parts, you can use a file ID to distinguish them, and include it as part of the key, along with the file position.

The following example loads 5 files, country_part1.gz through country_part5.gz, into a single csv relation. The config relation is parameterized by the index i, which is also included in the csv relation:

// read query
 
def pattern = "azure://raidocs.blob.core.windows.net/csv-import/countries/country_partNNN.tsv.gz"
def partid = range[1,5,1]
def files = string_replace[pattern, "NNN", string[x]] for x in partid
 
module config[i in partid]
    def path = files[i]
    def syntax:delim = '\t'
    def syntax:header_row = -1
    def syntax:header = (1, :ccode) ; (2, :cname)
    def schema = (:ccode, "string") ; (:cname, "string")
end
 
def csv(col, i, pos, v) = load_csv[config[i]](col, pos, v)
 
def output:num_codes = count[code: csv:ccode(i, pos, code) from i, pos]
def output:num_pos = count[pos : csv:ccode(i, pos, code) from i, code]
def output:names(n) = csv:cname(_, _, n)
    and (like_match("Ar\%", n) or like_match("Z\%", n) )

Note that while this code loads data for 250 countries, the number of file positions is smaller, since some rows start at the same file position in more than one input file. Including i in the csv relation makes sure that those rows are distinguished from each other.

Import Errors

Sometimes, there are import errors that are not anticipated or which cannot be handled. In either case, load_csv supports error handling to help you understand what may have gone wrong.

When the CSV input file cannot be reliably read (e.g., it is formatted incorrectly), Rel attempts to parse as many rows as possible. Any rows that it was not able to reliably parse are stored in the :load_errors sub-relation. This sub-relation can be accessed the same way as the rest of the sub-relations in the imported relation (e.g. csv:load_errors in the previous examples).

The :load_errors relation has the format (:load_errors, file_position, error_column, raw_line). Note that it has arity 4, instead of the normal arity 3 of successfully imported column relations. Examples are shown below.

Formatting Errors

Here is an example file that contains two formatting errors, namely, an extra column per line for the first two lines:

cocktail,price
"sazerac",30,"extra"
"martini",24,"another_extra"
"cosmopolitan",20

file: import-errors-extracol.csv (opens in a new tab)

In this example file, the first line is a header indicating that there should be two columns in the file. The second and third lines, however, have one extra field, which is an error. Finally, the last line is correct.

You can now attempt to import these data. Consider that the first column should be a string, while the second column should be an integer, and use the following code:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-extracol.csv"
 
    def schema = {
        (:cocktail, "string");
        (:price, "int");
    }
end
 
def csv = load_csv[config]
def output = csv

This output specifies two load_errors, and displays the erroneous lines as the last field. The third element of the :load_errors relation indicates the 1-based index of the field where Rel believes the error to be. In this case, both errors are in the second field since the schema you defined specifies two columns in the CSV file, but the two erroneous lines contain an extra third column.

Ensuring No Import Errors

Note that one easy way to ensure no loading errors is through the use of an integrity constraint. Such a constraint could just check if the :load_errors sub-relation within the csv relation is empty.

Here is the same example as before with the addition of an integrity constraint, no_load_errors, which is violated when attempting to load problematic data:

module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-extracol.csv"
 
    def schema = {
        (:cocktail, "string");
        (:price, "int");
    }
end
 
 
def csv = load_csv[config]
 
ic no_load_errors = {
    empty(csv:load_errors)
}

Alternatively, the following integrity constraint not only reports when load errors have occurred, but also reports the offending lines in the CSV file:

ic no_load_errors(x...) = {
    not csv:load_errors(_, x...)
}

You can add similar constraints to check for various kinds of data validation aspects. See Integrity Constraints for more details.

Error in the Data Type

In the following example, there is no real error in data, but the configuration specifies incorrect data types for each column: int for the cocktail name and string for the price. Here is the data:

cocktail,price
"sazerac",30
"martini",24
"cosmopolitan",20

file: import-errors-wrongtype.csv (opens in a new tab)

Try to load the data as follows (note the incorrect data type definitions):

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-wrongtype.csv"
 
    def schema = {
        (:cocktail, "int");
        (:price, "string");
    }
end
 
 
def csv = load_csv[config]
def output = csv

The system reports three errors in the data. All three errors are in the first column of the CSV file, which indicates that the cocktail names could not be converted to int.

Empty Lines

Rel omits empty lines in CSV files and proceeds to parse the subsequent lines containing data. For example, consider this CSV file that contains several blank lines:

cocktail,price
"sazerac",30
 
 
"martini",24
 
 
"cosmopolitan",20

file: import-errors-emptylines.csv (opens in a new tab)

You can load these data using the following code:

// read query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-emptylines.csv"
 
    def schema = {
        (:cocktail, "string");
        (:price, "int");
    }
end
 
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

The result shows that the empty lines were not considered errors, and the entire file was successfully loaded into the relation csv.

Missing Values

In datasets, some rows may have one or more missing values. Here’s an example that loads data with the price value missing from the third and fifth rows:

// read query
 
def mydata = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,,"2020-04-04"
"""
 
module config
    def data = mydata
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64,2)");
        (:date, "date");
    }
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

The fields with the missing values were skipped and not imported, but note that there are no :load_errors in the result.

The next example loads the same dataset as before, but representing missing values with the string NA:

// read query
 
def mydata = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
 
module config
    def data = mydata
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64,2)");
        (:date, "date");
    }
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

The result is the same as before, plus two load_errors. This is because the configuration specified that column 3 (which is missing from two rows) is a decimal(64,2). As the “NA” strings cannot be parsed as decimal(64,2), the system considers that there was an error in these rows. Similarly to the previous examples, Rel did a best-effort approach and imported whatever was possible.

The following code loads the same file — this time specifying that the third column has type String:

// read query
 
def mydata = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
 
module config
    def data = mydata
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "string");
        (:date, "date");
    }
end
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

There are no errors now.

Such files can thus be loaded using “string” as the type of the missing column; the string values can be cleaned, filtered, and converted to their intended type using Rel, after the load.

Error in Delimiter

In the following example, the data use a different delimiter (|) than the one you specify in the syntax option (;):

// read query
 
def mydata = """
cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20
"""
 
module config
    def data = mydata
 
    def schema = {
        (:cocktail, "string");
        (:price, "int");
    }
 
    def syntax:delim = ';'
end
 
 
def csv = load_csv[config]
def output = ::std::display::table[csv]

Tips

Reusing Configurations

If you need to perform CSV imports that have a common configuration but some slight variations, you can reuse a partial configuration by extending or overriding it.

The following example uses config_template to define config1 and config2, changing the type of column b with the help of the right_override stdlib utility:

// read query
 
module config_template
    def data = """
    a,b
    2,4
    3,9
    """
 
    def schema:a = "int"
    def schema:b = "int"
end
 
def config1 = config_template ++> (:schema, :b, "float")
def config2 = config_template ++> (:schema, :b, "string")
 
def output:one = load_csv[config1]
def output:two = load_csv[config2]

CSV Import

Goal

Introduction

The load_csv Relation

Getting Started

Defining Import Options

Querying CSV Data

Import Options

Path

Data

Integration

Schema

Syntax

CSV Data Model

Data Sources

Strings

Local Files

Cloud

Compressed Files

Schema Options

Column Data Types

Column Names

No Header Information

Syntax Options

Defining a Header Row

Modifying Delimiters

Importing Numbers With Group Marks

Omitting Top Rows

Loading TSV Files

Comprehensive Example

Import Scenarios

Replacing FilePos With Continuous Numbering

Using a Column as the Primary Key

Loading Multiple Files in the Same Relation

Multipart File Loads

Import Errors

Formatting Errors

Ensuring No Import Errors

Error in the Data Type

Empty Lines

Missing Values

Error in Delimiter

Tips

Reusing Configurations

See Also

The `load_csv` Relation