Skip to content
  • REL
  • HOW-TO GUIDES
  • Data I/O: CSV Import

Data I/O: CSV Import

This how-to guide demonstrates how to import data from a CSV file using Rel.

Goal

By following this guide, you will be able to import CSV data into the RKGMS database and learn how to modify the import configurations to handle any type of CSV format and data location.

Introduction

The Comma Separated Values (CSV) data format is one of the most common, and CSV data is ubiquitous. This how-to guide will demonstrate how to import a wide range of CSV-like data with Rel, and describe all of the CSV import configuration options.

Simple Import

This guide will begin with a simple example. For a more detailed one, see the Comprehensive Example below. Consider the following CSV file with two string columns available in this Azure container:

country, drink
Spain,Gazpacho
Spain,Sangría
Argentina,"Yerba Mate"
United States,"Coca-Cola"
Italy,San Pellegrino

The following Rel code loads this file into a relation called my_data, and displays it using the table utility:

// query
 
def my_data = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]
 
def output = table[my_data]

The imported CSV data are stored in the my_data relation, which is a Rel module. Each column in the original CSV is assigned to a subrelation, which is given the same name as the column header from the CSV. File line numbers are used as the key to connect row entries. For more details, see the CSV Data Model section below.

The table utility tells the rendering tool to display the imported data as a table.

By default, all columns are loaded as strings. There is no automatic type conversion, so ...,123,... becomes the string "123".

Defining Import Options

To specify data types for each column, include them in a config relation, which is a module that defines the import configuration. See the Modules concept guide to learn more about modules in Rel.

🔎

Note that config is an arbitrary module name and will be used throughout this guide, but you can give it any other name.

Consider this data:

cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

config:schema specifies data types for each column:

// update
 
// import configuration
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
    }
end
 
// import csv data as a base relation
def insert:my_data = load_csv[config]
 
// display imported data
def output = table[my_data]
🔎

The code above stores the imported CSV data in the base relation my_data.

You will usually want to store imported data as base relations, and then define views as derived relations based on the base relation data. See Derived and Base Relations for more details.

The statement def insert:my_data = load_csv[config] saves the imported data in the base relation my_data. To add data to a new or existing base relation, you can use the reserved insert relation. Note that the import is done within an update transaction, i.e., a non-read-only query, and the config definition is only present for the duration of the query. This corresponds to an update cell in the RAI notebooks. For more details on working with base relations, see the Updating Data: Working with Base Relations concept guide.

Querying CSV Data

Now you can query the imported CSV data by joining the columns of interest by their common key, i.e., the file position:

// query
 
def output(c, d) =
    my_data:cocktail(pos, c) and my_data:date(pos, d)
    from pos

The section CSV Data Model describes in more detail how the CSV data is stored in our RKGMS.

Import Options

For very simple situations, where no configuration needs to be specified, just provide the file path:

def data = load_csv["azure://...."]

For more customized imports, load_csv takes a configuration relation as an argument.

module config
    // ...
end
def data = load_csv[config]
🔎

The configuration relation is usually named config but any other name can be used.

A good practice is to make the configuration definition transient in an update query, so it will not affect other data loads.

However, you can also install load configurations. A common use case for this is defining template configurations. See section Reusing Configurations for more details.

Within the load configuration, you can specify:

OptionDescription
pathA URL for the CSV data.
integrationCredentials needed to access the data.
dataAn alternative way to specify the data itself.
schemaSchema of the CSV data.
syntaxSyntax of the CSV data.
⚠️

These configurations must be specified in a named module and cannot be supplied directly in the argument of load_csv. In other words, patterns like the following will result in an error:

load_csv[{
    (:path, "azure://.../data.csv");
    (:syntax, :delim, '\t')
}]

Path

The import option path is a string that specifies the location and name of the file you want to import. Currently, this string can point to azure://... (Microsoft Azure) or s3://... (Amazon S3) URLs. Refer to Accessing the Cloud for further details.

See Accessing the Cloud for further details on which storage cloud providers are supported. More information about what file types are supported can be found in Supported Data Files.

Example:

module config
    def path = "azure://myaccount.blob.core.windows.net/container/file.csv"
end

Integration

You may need to specify credentials for accessing your data in cloud storage, using the import option integration. See Accessing the Cloud for further details.

Data

The import option is a string specifying the actual data to be imported. If path is also present, the data field is ignored and only path is used for importing the data.

Example:

module config
    def data = "a,b,c\n1,2,3\n4,5,6"
end

You can also use the multi-line string functionality to write CSV data with several rows.

Schema

The import option schema is a binary relation indicating the file schema. The pairs are of the form (RelName, String). The column name is specified as RelName and the data type as String (e.g.: (:a, ,"int")).

The supported data types are "int", "string", "float","decimal(n, digits)", "date", and "datetime". For decimals, n is the number of bits and digits is the number of decimal digits, which should correspond to a valid Fixed Decimal Rel Type. See the Date Rel Type and Datetime Rel Type docs for more about the supported formats. The default behavior is that everything is loaded as a string. Note that the type definitions are case sensitive.

Example:

module config
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
    }
end

Syntax

The import option syntax is a module with sub-fields and is used to specify the CSV file format. The module syntax has the following sub-fields as options:

OptionDescription
syntax:headerA relation with (Int, RelName) pairs, specifying the symbols to be used as column names, where the Int indices indicate the column order. This option overrides the default (or existing) header names.
syntax:header_rowAn Int that specifies the row from where to start importing the file. The column names are specified by the content of that row, if they are not defined using the header option above. The values -1 and 0 indicate that no header is present and the system creates column names for you. See Defining a Header Row. If no header_row value is specified, the value 1 is used as default (that is, the first line of the file is used).
syntax:delimThe Char delimiter used in the file to separate the columns. The default delimiter is a comma ,.
syntax:quotecharA Char that indicates a quoted text field. A quoted text field is required to contain special characters like the delimiter or a newline character in a text. The default quote character is ".
syntax:escapecharThe Char used to escape the quote characters within a quoted (for example, text) field. The default value is a backslash \.

Example:

module config
    module syntax
        def delim = '|'
        def header_row = -1
    end
end
🔎

Schema and Syntax options are applied to all type of data regardless their source (cloud, local, string literal,…)

CSV Data Model

In Rel, data is best stored in Graph Normal Form (GNF), a fully normalized representation that takes advantage of Rel features such as modularity and specialization.

CSV data imported via load_csv follows basic GNF conventions. Concretely, the CSV data is represented as triplets of the form (colname, pos, v):

def data(colname, pos, v) = load_csv[config](colname, pos, v)

Note that:

  • The variable colname refers to the CSV column names and is of type RelName.
  • The variable pos is the record identifier and has the type FilePos. It uniquely identifies the row number of a given CSV file.
  • The variable v is the actual data.

By storing the CSV data in this format, the relation data becomes a Rel module. To understand how modules work, see the Modules concept guide.

The Rel code that defines data above is written in the point-wise form, where all variables are explicitly stated. A more compact way of writing the above query is by using the point-free style, where the variables are implied:

def data = load_csv[config]

Once imported, the raw CSV data will usually be schema-mapped and transformed further, according to the desired data model. GNF is recommended for cleaner modeling and efficient reasoning.

Specifying Data Sources

Cloud

You can import CSV files from the cloud using load_csv[config]. Refer to Accessing the Cloud for further details.

Local Files

To import CSV files, you need to use the RAI Console. In the Console, you can either:

  • Upload data directly to the database using the “Data” section in the object tree for the given DB.
  • Upload data using the RAI notebook file upload mechanism.

In both cases, the uploaded file is stored as a string in a base relation, say, my_uploaded_file. To process the file as CSV data, use the load_csv functionality with config:data pointing to that base relation:

def config:data(text) = my_uploaded_file(text) and String(text)
def my_data = load_csv[config]

The condition String(text) ensures that only string values get assigned to config:data. See the next section, Strings, for details on how to specify CSV data directly as a string.

Using the schema option, you can specify the data type of each of the imported columns:

module config
    def data = my_uploaded_file
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
    }
end
 
// import csv data as a base relation
def insert:my_data = load_csv[config]

Strings

CSV data can be specified directly as a string constant. This is handy for small data sets and tests.

Here is a small example:

// query
 
def config:data = """
a|b|c
1|2|3
4|5|6
"""
def config:syntax:delim = '|'
 
def output = load_csv[config]

Compression

You can import compressed CSV files. Refer to Supported Data Files for further details.

You can import files up to 64MB in size. If you try to import a larger file, you will receive a validation error. This limit applies to local files only; no limit is imposed for imports from cloud sources.

Specifying Schema Options

Column Data Types

Consider the following data file with four columns, each with a different data type: string, integer, float, and date.

cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

file: simple-import-4cols.csv

Using the schema option, you can import this file and specify the data type of each of the imported columns:

// query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def schema = {
        (:cocktail, "string");
        (:quantity, "int");
        (:price, "decimal(64, 2)");
        (:date, "date");
    }
end
 
def csv = load_csv[config]
def output = csv

This confirms that the data in each column has the data type that was specified above.

Note that quotes around individual values are stripped; to escape quotes, the default is \". Any spaces before or after a comma are preserved (as is standard with CSVs), with the exception of the header row.

Column Names

The previous examples imported the file using the default column names, as defined in the first row of the file (this row number can be specified with the header_row parameter). Rel also lets you specify any column names you would like. Once you decide to specify the header, you must specify all the column names. You’re unable to skip any of them.

The next example loads the same data, but changes the imported column names: it renames cocktail to beverage, quantity to amount, and price to value. (The name date is not changed). This is done using the syntax and header parameters within the config relation. The header parameter takes as input tuples of the form (col_pos, col_name), which indicate how to rename each of the columns. col_pos refers to the column position as it appears in the imported file.

// query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def syntax:header = {
        (1, :beverage); (2, :amount);
        (3, :value); (4, :date)
    }
 
    def schema = {
        (:beverage, "string");
        (:amount, "int");
        (:value, "decimal(64,2)");
        (:date, "date");
    }
end
 
def csv = load_csv[config]
def output = csv

To indicate that there is no header in the file, set the parameter syntax:header_row to 0 or -1:

// query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def syntax:header_row = -1
end
 
def csv = load_csv[config]
def output = csv

In this case, the system assigned default COLX names to the columns. Additionally, it considered the first row (which actually contains the column headers) as actual data and imported that data as well.

Since no types were specified, all data were read in as strings. You can specify the schema using the COLX attributes. For example for the simple file:

"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"

file: simple-import-4cols-noheader.csv

you can specify the types of the columns as follows:

// query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols-noheader.csv"
 
    def syntax:header_row = -1
 
    def schema = {
        (:COL1, "string");
        (:COL2, "int");
        (:COL3, "decimal(64,2)");
        (:COL4, "date");
    }
end
 
def csv = load_csv[config]
def output = csv

The data is now imported with their correct types: string, int, decimal, and date.

Specifying Syntax Options

Defining a Header Row

Going back to the data with the headers, you can also use the second row to define the column names.

Specifying header_row = 2 tells the system where to start reading within the file. The first row, i.e., cocktail,quantity,price,date, is ignored. The system starts at line 2, considers that line as a header, and then loads the rest of the data.

In general, all rows above the header_row are ignored. This option is useful when a number of lines at the beginning of the file need to be ignored. For example, if there are comments before the actual header and data that describe what the file is about. By specifying the right header_row, you can load the header and subsequent data without having to edit the file to remove any comments at the top.

In our example, this generates clearly incorrect results, since row 2 is not, in fact, a header file:

// query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
 
    def syntax:header_row = 2
end
 
def csv = load_csv[config]
def output = table[csv]

In this case, the correct solution is to not define the header_row option within config, since the imported CSV file starts with the header row. This is equivalent to def syntax:header_row = 1.

Modifying Delimiters

You can change the delimiters used to read the data. Consider the following file, which uses | as a delimiter:

cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20

file: simple-import-1col-pipedelim.csv

You can read such data by specifying the delim parameter in the config relation:

// query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-1col-pipedelim.csv"
 
    def schema = {
        (:cocktail, "string");
        (:price, "int");
    }
 
    def syntax:delim = '|'
end
 
def csv = load_csv[config]
def output = csv

Loading TSV Files

For the common case of loading a tab-separated file (TSV), set the delimiter to \t:

def config:syntax:delim = '\t'

The other config options will work in the same way as discussed in this document for CSVs.

Comprehensive Example

Here is a more comprehensive import example serving as a template that can be copy-and-pasted. It touches on multiple CSV loading aspects users will encounter when working with CSV data.

// update
 
def mydata = """
first,last,age,plays_for
Leonel,Messi,32,PSG
Cristiano,Ronaldo,36,"Manchester United"
"""
 
// import configuration
module config
    def data = mydata
 
    module schema
        def first = "string"
        def last = "string"
        def age = "int"
        def plays_for = "string"
    end
 
    module syntax
      def header_row = 1
      def delim = ','
 
      // to change the header names:
      // def header = {(1, :firstname); (2, :lastname); (3, :age); (4, :team)}
    end
end
 
// intermediate derived relation
def csv = load_csv[config]
 
// persist loaded CSV data in the database
def delete[:csv_base_relation] = csv_base_relation // clear potential previous contents of base relation data
def insert[:csv_base_relation] = csv
 
// display the persisted data
def output = table[csv_base_relation]

The CSV data is included as a string constant to make this example as self-contained as possible. To load CSV data from the cloud, please see the Cloud section.

The Rel code above is executed as an Update transaction needed to insert the loaded CSV data into the database and persist it in the base relation csv_base-relation. The command def delete[:csv_base_relation] = csv_base-relation ensures the new CSV data is inserted into a blank base relation and not appended to an existing one. If this behavior is not wanted, you can remove this line.

The intermediate derived relation csv is not persisted in the database. It is, however, still useful to first assign the loaded CSV data to an intermediate derived relation. This is especially helpful when data cleaning or filtering steps should be performed before persisting the loaded CSV data in the database.

Advanced Import Scenarios

Replacing FilePos With Continuous Numbering

The examples so far have used the type FilePos as a key for each row. Rel provides another option for loading the data: passing the load_csv output relation to the lined_csv utility, as follows:

// query
 
def config[:path] = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64, 2)"
def config[:schema, :date] = "date"
 
def csv = load_csv[config]
def rows_csv = lined_csv[csv]
def output = rows_csv

The result now has continuous numbering, i.e., successive IDs as keys. This is not always guaranteed when using only load_csv. Note that lined_csv only counts lines that contain data, including errors, while headers and empty lines are not counted.

Using a Column as the Primary Key

The previous examples used the file row number as the key to identify each row. This key is known as the primary key. Instead, you can specify another data column as the primary key. For example, consider the following data, where the first column id is meant to be the primary key:

id,cocktail
11,"martini"
22,"sazerac"
33,"cosmopolitan"
44,"bellini"

file: simple-import-2cols-with-key.csv

To use the id column as the key, you can proceed as follows: First, read in the data in a similar way to the examples so far. Then, convert the csv relation to another relation, cocktails, which replaces the FilePos row key with the values in the id column:

// query
 
module config
    def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-2cols-with-key.csv"
 
    def schema = {
        (:id, "int");
        (:cocktail, "string");
    }
end
 
def csv = load_csv[config]
 
def cocktails(:cocktail, id, cocktail) =
    csv(:id, pos, id) and csv(:cocktail, pos, cocktail)
    from pos
 
def output = cocktails

Please note that the key used should be unique to each row. It is good practice to ensure that this key is unique using an integrity constraint.

In general, the loaded CSV can be transformed in multiple ways before it is used by the rest of the model or saved as a base relation. This corresponds to an ELT (extract-load-transform) approach and can be more flexible than the traditional ETL (extract-transform-load), which transforms before the load.

Loading Multiple Files in the Same Relation

Certain cases call for loading data from several different files into the same relation. For example, you could have two files that correspond to sales over the period of two different months, or a dataset that was simply split into several files (as discussed in more detail in the following section). Here are two such example files showing cocktails and units sold in the months of March and April:

cocktail,units
martini,12
sazerac,23

file: cocktail-units-march.csv

cocktail,units
martini,12
bellini,32

file: cocktail-units-april.csv

To load the data in this case, you need to use load_csv twice, one for each file. In order to distinguish data by file, you can use the month name as an additional key for the import configuration module config and the imported data csv:

// query
 
def my_schema = {
    (:cocktail, "string");
    (:units, "int");
}
 
module config["march"]
    def path = "azure://raidocs.blob.core.windows.net/csv-import/cocktail-units-march.csv"
    def schema = my_schema
end
 
module config["april"]
    def path = "azure://raidocs.blob.core.windows.net/csv-import/cocktail-units-april.csv"
    def schema = my_schema
end
 
def csv[month] = load_csv[config[month]]
 
def output = csv

Examining the output, observe that you have an additional string, either "march" or "april" identifying each dataset.

Multipart File Loads

In cases where a large data set is split up into different parts, you can use a file ID to distinguish them, and include it as part of the key, along with the file position.

The following example loads 5 files, country_part1.gz through country_part5.gz, into a single csv relation. The config relation is parameterized by the index i, which is also included in the csv relation:

// query
 
def pattern = "azure://raidocs.blob.core.windows.net/csv-import/countries/country_partNNN.tsv.gz"
def partid = range[1,5,1]
def files = string_replace[pattern, "NNN", string[x]] for x in partid
 
module config[i in partid]
    def path = files[i]
    def syntax:delim = '\t'
    def syntax:header_row = -1
    def syntax:header = (1, :ccode) ; (2, :cname)
    def schema = (:ccode, "string") ; (:cname, "string")
end
 
def csv(col, i, pos, v) = load_csv[config[i]](col, pos, v)
 
def output:num_codes = count[code: csv:ccode(i, pos, code) from i, pos]
def output:num_pos = count[pos : csv:ccode(i, pos, code) from i, code]
def output:names(n) = csv:cname(_, _, n)
    and (like_match("Ar\%", n) or like_match("Z\%", n) )

Note that while this code loads data for 250 countries, the number of file positions is smaller, since some rows start at the same file position in more than one input file. Including i in the csv relation makes sure that those rows are distinguished from each other.

Import Errors

Sometimes, there are import errors that are not anticipated or which cannot be handled. In either case, load_csv supports error handling to help you understand what may have gone wrong.

When the CSV input file cannot be reliably read (e.g., it is formatted incorrectly), Rel attempts to parse as many rows as possible. Any rows that it was not able to reliably parse are stored in the :load_errors sub-relation. This sub-relation can be accessed the same way as the rest of the sub-relations in the imported relation (e.g. csv:load_errors in the previous examples).

The :load_errors relation has the format (:load_errors, file_position, error_column, raw_line). Note that it has arity 4, instead of the normal arity 3 of successfully imported column relations. Examples are shown below.

Formatting Errors

Here is an example file that contains two formatting errors, namely, an extra column per line for the first two lines:

cocktail,price
"sazerac",30,"extra"
"martini",24,"another_extra"
"cosmopolitan",20

file: import-errors-extracol.csv

In this example file, the first line is a header indicating that there should be two columns in the file. The second and third lines, however, have one extra field, which is an error. Finally, the last line is correct.

You can now attempt to import this data. Consider that the first column should be a string, while the second column should be an integer, and use the following code:

// query
 
def config[:path] = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-extracol.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"
 
def csv = load_csv[config]
def output = csv

This output specifies two load_errors, and displays the erroneous lines as the last field. The third element of the :load_errors relation indicates the 1-based index of the field where Rel believes the error to be. In this case, both errors are in the second field since the schema you defined specifies two columns in the CSV file, but the two erroneous lines contain an extra third column.

Ensuring No Import Errors

Note that one easy way to ensure no loading errors is through the use of an integrity constraint. Such a constraint could just check if the :load_errors sub-relation within the csv relation is empty.

Here is the same example as before with the addition of an integrity constraint, no_load_errors, which is violated when attempting to load problematic data:

def config[:path] = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-extracol.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"
 
def csv = load_csv[config]
 
ic no_load_errors = {
    empty(csv:load_errors)
}

Alternatively, the following integrity constraint not only reports when load errors have occurred, but also reports the offending lines in the CSV file:

ic no_load_errors(x...) = {
    not csv:load_errors(_, x...)
}

You can add similar constraints to check for various kinds of data validation aspects. Refer to the Integrity Constrains concept guide for more information.

Error in the Data Type

In the following example, there is no real error in data, but the configuration specifies incorrect data types for each column: int for the cocktail name and string for the price. Here is the data:

cocktail,price
"sazerac",30
"martini",24
"cosmopolitan",20

file: import-errors-wrongtype.csv

Try to load the data as follows (note the incorrect data type definitions):

// query
 
def config[:path] = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-wrongtype.csv"
def config[:schema, :cocktail] = "int"
def config[:schema, :price] = "string"
 
def csv = load_csv[config]
def output = csv

The system reports three errors in the data. All three errors are in the first column of the CSV file, which indicates that the cocktail names could not be converted to int.

Empty Lines

Rel ignores empty lines in CSV files and continues parsing in the next line with data. For example, consider the following CSV file that contains several blank lines:

cocktail,price
"sazerac",30


"martini",24


"cosmopolitan",20

file: import-errors-emptylines.csv

You can read this data using the following code:

// query
 
def config[:path] = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-emptylines.csv"
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"
 
def csv = load_csv[config]
def output = csv

The result shows that the empty lines were not considered errors, and the entire file was successfully loaded into the relation csv.

Missing Values

Datasets might simply not contain one or more values per row. Here is an example data load where the price value is missing from the third and fifth rows:

// query
 
def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64,2)"
def config[:schema, :date] = "date"
 
def csv = load_csv[config]
def output = csv

The fields with the missing values were skipped and not imported, but note that there are no :load_errors in the result.

The next example loads the same dataset as before, but representing missing values with the string NA:

// query
 
def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "decimal(64,2)"
def config[:schema, :date] = "date"
 
def csv = load_csv[config]
def output = csv

The result is the same as before, plus two load_errors. This is because the configuration specified that column 3 (which is missing from two rows) is a decimal(64,2). As the “NA” strings cannot be parsed as decimal(64,2), the system considers that there was an error in these rows. Similarly to the previous examples, Rel did a best-effort approach and imported whatever was possible.

The following code loads the same file, specifying instead that the third column has type “string”:

// query
 
def config[:data] = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :quantity] = "int"
def config[:schema, :price] = "string"
def config[:schema, :date] = "date"
 
def csv = load_csv[config]
def output = csv

There are no errors now.

Such files can thus be loaded using “string” as the type of the missing column; the string values can be cleaned, filtered, and converted to their intended type using Rel, after the load.

Error in Delimiter

In the following example, the data use a different delimiter (|) than the one you specify in the syntax option (;):

// query
 
def config[:data] = """
cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20
"""
def config[:schema, :cocktail] = "string"
def config[:schema, :price] = "int"
 
def config[:syntax, :delim] = ';'
 
def csv = load_csv[config]
def output = csv

Tips

Reusing Configurations

If you need to perform CSV imports that have a common configuration but some slight variations, you can reuse a partial configuration by extending or overriding it.

The following example uses config_template to define config1 and config2, changing the type of column b with the help of the right_override stdlib utility:

// query
 
module config_template
    def data = """
    a,b
    2,4
    3,9
    """
 
    def schema:a = "int"
    def schema:b = "int"
end
 
def config1 = config_template ++> (:schema, :b, "float")
def config2 = config_template ++> (:schema, :b, "string")
 
def output:one = load_csv[config1]
def output:two = load_csv[config2]

Summary

You have learned how to import CSV data using Rel, in particular using the load_csv functionality. By following the examples, you are now able to import various forms of CSV data into your RKGMS database, including string constants, local CSV files, and files in cloud storage. You have learned about data import errors, and are able to follow the steps if you run into any loading errors. For related reading, see the CSV Export guide.

Was this doc helpful?