CSV Import
This how-to guide demonstrates how to import CSV data using Rel.
Goal
By following this guide, you will be able to import CSV data into the RKGS database and learn how to modify the import configurations to handle any type of CSV format and data location.
Introduction
The Comma Separated Values (CSV) data format is one of the most common, and CSV data is ubiquitous. This how-to guide demonstrates how to import a wide range of CSV-like data with Rel, and describes all of the CSV import configuration options.
The load_csv
Relation
The main way to load CSV data into the system is through the built-in Rel relation load_csv
.
def config = ...
def my_csv = load_csv[config]
The relation config
is used to define all CSV load configurations.
For details on all the available options, see Import Options below.
Getting Started
This guide will begin with a simple example. For a more detailed one, see the comprehensive example below. Consider the following CSV file with two string columns available in this Azure container (opens in a new tab):
country, drink
Spain,Gazpacho
Spain,Sangría
Argentina,"Yerba Mate"
United States,"Coca-Cola"
Italy,San Pellegrino
The following Rel code loads this file into a relation called my_data
, and displays it using the table
utility:
// read query
def my_data = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]
def output = ::std::display::table[my_data]
The imported CSV data are stored in the my_data
relation, which is a Rel module.
Each column in the original CSV is assigned to a subrelation, which is given the same name as the column header from the CSV.
File line numbers are used as the key to connect row entries.
For more details, see CSV Data Schema in the Working With CSV Data concept guide.
The table
utility tells the rendering tool to display the imported data as a table.
By default, all columns are loaded as strings.
There is no automatic type conversion, so ...,123,...
becomes the string "123"
.
Defining Import Options
To specify data types for each column, include them in a config
relation,
which is a module that defines the import configuration.
See the Modules concept guide to learn more about modules in Rel.
Note that config
is an arbitrary module name and will be used throughout this guide, but you can give it any other name.
Consider this data:
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"
config:schema
specifies data types for each column:
// write query
// Set the import configuration.
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64, 2)");
(:date, "date");
}
end
// Import CSV data as a base relation.
def insert:my_data = load_csv[config]
// Display the imported data.
def output = ::std::display::table[my_data]
The code above stores the imported CSV data in the base relation my_data
.
You will usually want to store imported data as base relations, and then define views as derived relations based on the base relation data. See Working With Base Relations for more details.
The statement def insert:my_data = load_csv[config]
saves the imported data in the base relation my_data
.
To add data to a new or existing base relation, you can use the control relation insert
.
Note that the import is done within a write query, and the config
definition is only present for the duration of the query.
For more details, see Working with Base Relations.
Querying CSV Data
Now you can query the imported CSV data by joining the columns of interest by their common key, i.e., the file position:
// read query
def output(c, d) =
my_data:cocktail(pos, c) and my_data:date(pos, d)
from pos
See CSV Data Schema for more details on how CSV data are stored in the RelationalAI RKGS.
Import Options
For very simple situations, where no configuration needs to be specified, just provide the file path:
def data = load_csv["azure://...."]
For more customized imports, load_csv
takes a configuration relation as an argument.
module config
// ...
end
def data = load_csv[config]
The configuration relation is usually named config
but any other name can be used.
A good practice is to make the configuration definition transient using a read query, so it will not affect other data loads.
However, you can also install load configurations. A common use case for this is defining template configurations. See section Reusing Configurations for more details.
Within the load configuration, you can specify:
Option | Description |
---|---|
path | A URL for the CSV data. |
data | An alternative way to specify the data itself. |
integration | Credentials needed to access the data. |
schema | Schema of the CSV data. |
syntax | Syntax of the CSV data. |
These configurations must be specified in a named module and cannot be supplied directly in the argument of load_csv
. In other words, patterns like the following will result in an error:
load_csv[{
(:path, "azure://.../data.csv");
(:syntax, :delim, '\t')
}]
Path
The import option path
is a string that specifies the location and name of the file you want to import.
Currently, this string can point to azure://...
(Microsoft Azure) or s3://...
(Amazon S3) URLs. Refer to Accessing the Cloud for further details.
See Accessing the Cloud for further details on which storage cloud providers are supported. More information about what file types are supported can be found in Supported Data Files.
Example:
module config
def path = "azure://myaccount.blob.core.windows.net/container/file.csv"
end
Data
The import option is a string specifying the actual data to be imported.
If path
is also present, the data
field is ignored and only path
is used for importing the data.
Example:
module config
def data = "a,b,c\n1,2,3\n4,5,6"
end
You can also use the multi-line string functionality to write CSV data with several rows.
Integration
You may need to specify credentials to access your data in cloud storage, using the import option integration
.
See the Accessing the Cloud guide for more details.
Schema
The import option schema
is a binary relation indicating the file schema.
The pairs are of the form (RelName, String)
.
The column name is specified as a Symbol (of type RelName
) and the data type as String
(e.g.: (:a, ,"int")
).
The supported data types are "int"
, "string"
, "float"
,"decimal(n, digits)"
, "date"
, "datetime"
and "bool"
.
For decimals, n
is the number of bits and digits
is the number of decimal digits, which should correspond to a valid Fixed Decimal Rel Type. See the Date Rel Type and Datetime Rel Type docs for more about the supported formats.
For booleans, you can use one of the strings: "true"
, "True"
, "1"
, "T"
to represent true, and one of: "false"
, "False"
, "0"
, "F"
to represent false.
The default behavior is that everything is loaded as a string.
Note that the type definitions are case sensitive.
Example:
module config
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64, 2)");
(:date, "date");
(:happyhour, "bool");
}
end
Syntax
The import option syntax
is a module with sub-fields and is used to specify the CSV file format.
The module syntax
has the following sub-fields as options:
Option | Description |
---|---|
syntax:header | A relation with (Int, RelName) pairs, specifying the symbols to be used as column names, where the Int indices indicate the column order. This option overrides the default (or existing) header names. |
syntax:header_row | An Int that specifies the row from where to start importing the file. The column names are specified by the content of that row, if they are not defined using the header option above. The values -1 and 0 indicate that no header is present and the system creates column names for you. See Defining a Header Row. If no header_row value is specified, the value 1 is used as default (that is, the first line of the file is used). |
syntax:datarow | An Int that specifies the row from where to begin parsing values. |
sytnax:missingstrings | A String or multiple instances of String that should be interpreted as missing values. By default, only empty fields are considered missing. |
syntax:delim | The ASCII Char delimiter used in the file to separate the columns. The default delimiter is a comma , . |
syntax:quotechar | An ASCII Char that indicates a quoted text field. A quoted text field is required to contain special characters like the delimiter or a newline character in a text. The default quote character is " . |
syntax:escapechar | An ASCII Char used to escape the quote characters within a quoted — for example, text — field. The default value is a backslash \ . |
syntax:decimalchar | An ASCII Char to be used when parsing float values with decimal parts. The default is . . |
syntax:groupmark | An ASCII Char denoting the number grouping mark. This is useful for parsing numbers with separators (for example, for thousands — like 1,000.00 — you can set groupmark=',' . By default, group marks are not permitted in numbers. |
Example:
module config
module syntax
def delim = '|'
def header_row = -1
end
end
schema
and syntax
options are applied to all types of data, regardless of their source (cloud, local, string literal, etc.).
CSV Data Model
In Rel, data is best stored in Graph Normal Form (GNF), a fully normalized representation that takes advantage of Rel features such as modularity and specialization.
CSV data imported via load_csv
follow basic GNF conventions.
Concretely, the CSV data are represented as triplets of the form (colname, filepos, value)
.
// read query
def output = load_csv["azure://raidocs.blob.core.windows.net/csv-import/simple-drinks.csv"]
For more details, see CSV Data Schema in the Working With CSV Data concept guide.
Data Sources
This section covers the available options to import CSV data.
Strings
You can directly specify CSV data as a string. This is handy for small datasets and tests.
Here’s an example:
// read query
def mydata = """
a|b|c
1|2|3
4|5|6
"""
module config
def data = mydata
def syntax:delim = '|'
end
def output = ::std::display::table[load_csv[config]]
Local Files
You can load local CSV data using the importing functionality from the RAI Console.
The uploaded file is stored as a string in a base relation, say, my_uploaded_file
.
To process the file as CSV data, use the load_csv
functionality with config:data
pointing to that base relation:
def config:data(text) = my_uploaded_file(text) and String(text)
def my_data = load_csv[config]
The condition String(text)
ensures that only string values get assigned to config:data
.
See the next section, Strings, for details on how to specify CSV data directly as a string.
Using the schema
option, you can specify the data type of each of the imported columns:
module config
def data = my_uploaded_file
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64, 2)");
(:date, "date");
}
end
// Import CSV data as a base relation.
def insert:my_data = load_csv[config]
You can import files up to 64MB in size. If you try to import a larger file, you will receive a validation error. This limit applies to local files only; no limit is imposed for imports from cloud sources.
Cloud
You can import public and private CSV data from the supported cloud providers.
Here’s an example that loads public data:
module config
def path = "s3://my-s3-bucket/myfile.csv"
end
def my_csv = load_json[config]
See Accessing the Cloud for further details.
Compressed Files
You can import compressed CSV files from the cloud. No specific options are required, so you can just provide the file path with the compression extension. For example:
module config
def path = "azure://myaccount.blob.core.windows.net/sascontainer/myfile.csv.gz"
end
def my_data = load_csv[config]
Currently only gzip compression — ending in .gz
— is supported.
See Compressed Data for more details.
Schema Options
Column Data Types
Consider the following data file with four columns, each with a different data type: string
, integer
, float
, and date
.
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"
file: simple-import-4cols.csv (opens in a new tab)
Using the schema
option, you can import this file and specify the data type of each of the imported columns:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64, 2)");
(:date, "date");
}
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
This confirms that the data in each column has the data type that was specified above.
Note that quotes around individual values are stripped; to escape quotes, the default is \"
.
Any spaces before or after a comma are preserved (as is standard with CSVs), with the exception of the header row.
Column Names
The previous examples imported the file using the default column names, as defined in the first row of the file (this row number can be specified with the header_row
parameter).
Rel also lets you specify any column names you would like.
Once you decide to specify the header, you must specify all the column names. You’re unable to skip any of them.
The next example loads the same data, but changes the imported column names:
it renames cocktail
to beverage
, quantity
to amount
, and price
to value
. (The name date
is not changed).
This is done using the syntax
and header
parameters within the config
relation.
The header
parameter takes as input tuples of the form (col_pos, col_name)
, which indicate how to rename each of the columns.
col_pos
refers to the column position as it appears in the imported file.
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def syntax:header = {
(1, :beverage); (2, :amount);
(3, :value); (4, :date)
}
def schema = {
(:beverage, "string");
(:amount, "int");
(:value, "decimal(64,2)");
(:date, "date");
}
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
No Header Information
To indicate that there is no header in the file, set the parameter syntax:header_row
to 0
or -1
:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def syntax:header_row = -1
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
In this case, the system assigned default COLX
names to the columns.
Additionally, it considered the first row (which actually contains the column headers) as actual data and imported that data as well.
Since no types were specified, all data were read in as strings. You can specify the schema using the COLX
attributes. For example for the simple file:
"martini",2,12.50,"2020-01-01"
"sazerac",4,14.25,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,12.25,"2020-04-04"
file: simple-import-4cols-noheader.csv (opens in a new tab)
you can specify the types of the columns as follows:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols-noheader.csv"
def syntax:header_row = -1
def schema = {
(:COL1, "string");
(:COL2, "int");
(:COL3, "decimal(64,2)");
(:COL4, "date");
}
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
The data are now imported with their correct types: string
, int
, decimal
, and date
.
Syntax Options
Defining a Header Row
Going back to the data with the headers, you can also use the second row to define the column names.
Specifying header_row = 2
tells the system where to start reading within the file.
The first row, i.e., cocktail,quantity,price,date
, is ignored.
The system starts at line 2, considers that line as a header, and then loads the rest of the data.
In general, all rows above the header_row
are ignored.
This option is useful when a number of lines at the beginning of the file need to be ignored.
One example is if there are comments before the actual header and data that describe what the file is about.
By specifying the right header_row
,
you can load the header and subsequent data without having to edit the file to remove any comments at the top.
The example below clearly generates incorrect results, since row 2 is not, in fact, a header file:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def syntax:header_row = 2
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
In this case, the correct solution is to not define the header_row
option within config
, since the imported CSV file starts with the header row.
This is equivalent to def syntax:header_row = 1
.
Modifying Delimiters
You can change the delimiters used to read the data. Consider the following file, which uses |
as a delimiter:
cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20
file: simple-import-1col-pipedelim.csv (opens in a new tab)
You can read such data by specifying the delim
parameter in the config
relation:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-1col-pipedelim.csv"
def schema = {
(:cocktail, "string");
(:price, "int");
}
def syntax:delim = '|'
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
Importing Numbers With Group Marks
In some cases, the data within a CSV file may contain numbers that have group marks, for example, 1,000,000.00
.
You can import such numbers by specifying the schema of the CSV file, which allows the system to know where to expect a float.
Additionally, you can use the groupmark
option.
Here’s an example:
// read query
module config
def data = """
cocktail,sales
sazerac,1,000.00
daquiri,2,000.00
"""
def schema = {
(:cocktail, "string");
(:sales, "float");
}
def syntax:groupmark = ','
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
Note that by removing the syntax:groupmark = ','
option, the loading results in an error.
Omitting Top Rows
In some cases, the first few lines of a CSV file may contain comments or information that will not parse correctly.
You can omit loading a specific number of lines at the beginning of the file by using the syntax:datarow
option as follows:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def syntax:datarow = 3
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
This omits the first two rows and loads the rest into the csv
relation.
Note that by not specifying information about the header, the first row was used as a header by default.
To ensure that the first row doesn’t serve as a header, you also need to use the header_row=-1
option:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def syntax:header_row = -1
def syntax:datarow = 3
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
Here, the columns have generic names, such as COL1
.
Loading TSV Files
For the common case of loading a tab-separated file (TSV), set the delimiter to \t
:
module config
def syntax:delim = '\t'
end
The other config
options apply similarly as the ones for CSV files.
Comprehensive Example
Here’s a more comprehensive import example, which you can copy and paste. It touches on multiple CSV loading aspects users will encounter when working with CSV data.
// write query
def mydata = """
first,last,age,plays_for
Leonel,Messi,32,PSG
Cristiano,Ronaldo,36,"Manchester United"
"""
// Set the import configuration.
module config
def data = mydata
module schema
def first = "string"
def last = "string"
def age = "int"
def plays_for = "string"
end
module syntax
def header_row = 1
def delim = ','
// To change the header names:
// `def header = {(1, :firstname); (2, :lastname); (3, :age); (4, :team)}`.
end
end
// Define the intermediate derived relation.
def csv = load_csv[config]
// Persist the loaded CSV data in the database.
def delete[:csv_base_relation] = csv_base_relation // Clear potential previous contents of base relation data.
def insert[:csv_base_relation] = csv
// Display the persisted data.
def output = ::std::display::table[csv_base_relation]
The CSV data are included as a string constant to make this example as self-contained as possible. To load CSV data from the cloud, see Cloud.
The Rel code above is executed as a write query.
It inserts the loaded CSV data into the database and persists them in the base relation csv_base_relation
.
The command def delete[:csv_base_relation] = csv_base_relation
ensures the new CSV data are inserted into a blank base relation and not appended to an existing one.
If this behavior is not wanted, you can remove this line.
The intermediate derived relation csv
is not persisted in the database.
It is, however, still useful to first assign the loaded CSV data to an intermediate derived relation.
This is especially helpful when data cleaning or filtering steps should be performed before persisting the loaded CSV data in the database.
Import Scenarios
This section presents the most common CSV import use cases.
Replacing FilePos With Continuous Numbering
The examples so far have used the type FilePos as a key for each row.
Rel provides another option for loading the data: passing the load_csv
output relation to the lined_csv
utility.
For instance:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-4cols.csv"
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64, 2)");
(:date, "date");
}
end
def csv = load_csv[config]
def rows_csv = lined_csv[csv]
def output = ::std::display::table[rows_csv]
The result now has continuous numbering, i.e., successive IDs as keys. This is not always guaranteed when using only load_csv
. Note that lined_csv
only counts lines that contain data, including errors, while headers and empty lines are not counted.
Using a Column as the Primary Key
The previous examples used the file row number as the key to identify each row.
This key is known as the primary key.
Instead, you can specify another data column as the primary key.
For example, consider the following data, where the first column id
is meant to be the primary key:
id,cocktail
11,"martini"
22,"sazerac"
33,"cosmopolitan"
44,"bellini"
file: simple-import-2cols-with-key.csv (opens in a new tab)
To use the id
column as the key, you can proceed as follows:
First, read in the data in a similar way to the examples so far.
Then, convert the csv
relation to another relation, cocktails
,
which replaces the FilePos
row key with the values in the id
column:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/simple-import-2cols-with-key.csv"
def schema = {
(:id, "int");
(:cocktail, "string");
}
end
def csv = load_csv[config]
def cocktails(:cocktail, id, cocktail) =
csv(:id, pos, id) and csv(:cocktail, pos, cocktail)
from pos
def output = ::std::display::table[cocktails]
Please note that the key used should be unique to each row. It is good practice to ensure that this key is unique using an integrity constraint.
In general, the loaded CSV can be transformed in multiple ways before it is used by the rest of the model or saved as a base relation. This corresponds to an ELT (extract-load-transform) approach and can be more flexible than the traditional ETL (extract-transform-load), which transforms before the load.
Loading Multiple Files in the Same Relation
Certain cases call for loading data from several different files into the same relation. For example, you could have two files that correspond to sales over the period of two different months, or a dataset that was simply split into several files (as discussed in more detail in the following section). Here are two such example files showing cocktails and units sold in the months of March and April:
cocktail,units
martini,12
sazerac,23
file: cocktail-units-march.csv (opens in a new tab)
cocktail,units
martini,12
bellini,32
file: cocktail-units-april.csv (opens in a new tab)
To load the data in this case, you need to use load_csv
twice, one for each file.
In order to distinguish data by file, you can use the month name as an additional key for the import configuration module config
and the imported data csv
:
// read query
def my_schema = {
(:cocktail, "string");
(:units, "int");
}
module config["march"]
def path = "azure://raidocs.blob.core.windows.net/csv-import/cocktail-units-march.csv"
def schema = my_schema
end
module config["april"]
def path = "azure://raidocs.blob.core.windows.net/csv-import/cocktail-units-april.csv"
def schema = my_schema
end
def csv[month] = load_csv[config[month]]
def output = ::std::display::table[csv]
Examining the output, observe that you have an additional string, either "march"
or "april"
identifying each dataset.
Multipart File Loads
In cases where a large data set is split up into different parts, you can use a file ID to distinguish them, and include it as part of the key, along with the file position.
The following example loads 5 files, country_part1.gz
through country_part5.gz
, into a single csv
relation.
The config
relation is parameterized by the index i
, which is also included in the csv
relation:
// read query
def pattern = "azure://raidocs.blob.core.windows.net/csv-import/countries/country_partNNN.tsv.gz"
def partid = range[1,5,1]
def files = string_replace[pattern, "NNN", string[x]] for x in partid
module config[i in partid]
def path = files[i]
def syntax:delim = '\t'
def syntax:header_row = -1
def syntax:header = (1, :ccode) ; (2, :cname)
def schema = (:ccode, "string") ; (:cname, "string")
end
def csv(col, i, pos, v) = load_csv[config[i]](col, pos, v)
def output:num_codes = count[code: csv:ccode(i, pos, code) from i, pos]
def output:num_pos = count[pos : csv:ccode(i, pos, code) from i, code]
def output:names(n) = csv:cname(_, _, n)
and (like_match("Ar\%", n) or like_match("Z\%", n) )
Note that while this code loads data for 250 countries, the number of file positions is smaller, since some rows
start at the same file position in more than one input file.
Including i
in the csv
relation makes sure that those rows are distinguished from each other.
Import Errors
Sometimes, there are import errors that are not anticipated or which cannot be handled. In either case, load_csv
supports error handling to help you understand what may have gone wrong.
When the CSV input file cannot be reliably read (e.g., it is formatted incorrectly), Rel attempts to parse as many rows as possible.
Any rows that it was not able to reliably parse are stored in the :load_errors
sub-relation.
This sub-relation can be accessed the same way as the rest of the sub-relations in the imported relation (e.g. csv:load_errors
in the previous examples).
The :load_errors
relation has the format (:load_errors, file_position, error_column, raw_line)
.
Note that it has arity 4, instead of the normal arity 3 of successfully imported column relations.
Examples are shown below.
Formatting Errors
Here is an example file that contains two formatting errors, namely, an extra column per line for the first two lines:
cocktail,price
"sazerac",30,"extra"
"martini",24,"another_extra"
"cosmopolitan",20
file: import-errors-extracol.csv (opens in a new tab)
In this example file, the first line is a header indicating that there should be two columns in the file. The second and third lines, however, have one extra field, which is an error. Finally, the last line is correct.
You can now attempt to import these data. Consider that the first column should be a string, while the second column should be an integer, and use the following code:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-extracol.csv"
def schema = {
(:cocktail, "string");
(:price, "int");
}
end
def csv = load_csv[config]
def output = csv
This output specifies two load_errors
, and displays the erroneous lines as the last field.
The third element of the :load_errors
relation indicates the 1-based index of the field where Rel believes the error to be.
In this case, both errors are in the second field since the schema you defined specifies two columns in the CSV file, but the two erroneous lines contain an extra third column.
Ensuring No Import Errors
Note that one easy way to ensure no loading errors is through the use of an integrity constraint.
Such a constraint could just check if the :load_errors
sub-relation within the csv
relation is empty.
Here is the same example as before with the addition of an integrity constraint, no_load_errors
,
which is violated when attempting to load problematic data:
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-extracol.csv"
def schema = {
(:cocktail, "string");
(:price, "int");
}
end
def csv = load_csv[config]
ic no_load_errors = {
empty(csv:load_errors)
}
Alternatively, the following integrity constraint not only reports when load errors have occurred, but also reports the offending lines in the CSV file:
ic no_load_errors(x...) = {
not csv:load_errors(_, x...)
}
You can add similar constraints to check for various kinds of data validation aspects. See Integrity Constraints for more details.
Error in the Data Type
In the following example, there is no real error in data,
but the configuration specifies incorrect data types for each column:
int
for the cocktail
name and string
for the price. Here is the data:
cocktail,price
"sazerac",30
"martini",24
"cosmopolitan",20
file: import-errors-wrongtype.csv (opens in a new tab)
Try to load the data as follows (note the incorrect data type definitions):
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-wrongtype.csv"
def schema = {
(:cocktail, "int");
(:price, "string");
}
end
def csv = load_csv[config]
def output = csv
The system reports three errors in the data.
All three errors are in the first column of the CSV file, which indicates that the cocktail names could not be converted to int
.
Empty Lines
Rel omits empty lines in CSV files and proceeds to parse the subsequent lines containing data. For example, consider this CSV file that contains several blank lines:
cocktail,price
"sazerac",30
"martini",24
"cosmopolitan",20
file: import-errors-emptylines.csv (opens in a new tab)
You can load these data using the following code:
// read query
module config
def path = "azure://raidocs.blob.core.windows.net/csv-import/import-errors-emptylines.csv"
def schema = {
(:cocktail, "string");
(:price, "int");
}
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
The result shows that the empty lines were not considered errors,
and the entire file was successfully loaded into the relation csv
.
Missing Values
In datasets, some rows may have one or more missing values.
Here’s an example that loads data with the price
value missing from the third and fifth rows:
// read query
def mydata = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,,"2020-04-04"
"""
module config
def data = mydata
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64,2)");
(:date, "date");
}
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
The fields with the missing values were skipped and not imported, but note that there
are no :load_errors
in the result.
The next example loads the same dataset as before,
but representing missing values with the string NA
:
// read query
def mydata = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
module config
def data = mydata
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "decimal(64,2)");
(:date, "date");
}
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
The result is the same as before, plus two load_errors
.
This is because the configuration specified that column 3 (which is missing from two rows) is a decimal(64,2)
.
As the “NA” strings cannot be parsed as decimal(64,2)
,
the system considers that there was an error in these rows.
Similarly to the previous examples, Rel did a best-effort approach and imported whatever was possible.
The following code loads the same file — this time specifying that the third column has type String
:
// read query
def mydata = """
cocktail,quantity,price,date
"martini",2,12.50,"2020-01-01"
"sazerac",4,NA,"2020-02-02"
"cosmopolitan",4,11.00,"2020-03-03"
"bellini",3,NA,"2020-04-04"
"""
module config
def data = mydata
def schema = {
(:cocktail, "string");
(:quantity, "int");
(:price, "string");
(:date, "date");
}
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
There are no errors now.
Such files can thus be loaded using “string” as the type of the missing column; the string values can be cleaned, filtered, and converted to their intended type using Rel, after the load.
Error in Delimiter
In the following example, the data use a different delimiter (|
)
than the one you specify in the syntax option (;
):
// read query
def mydata = """
cocktail|price
"sazerac"|30
"martini"|24
"cosmopolitan"|20
"""
module config
def data = mydata
def schema = {
(:cocktail, "string");
(:price, "int");
}
def syntax:delim = ';'
end
def csv = load_csv[config]
def output = ::std::display::table[csv]
Tips
Reusing Configurations
If you need to perform CSV imports that have a common configuration but some slight variations, you can reuse a partial configuration by extending or overriding it.
The following example uses config_template
to define config1
and config2
, changing the type of column b
with the help of the right_override
stdlib utility:
// read query
module config_template
def data = """
a,b
2,4
3,9
"""
def schema:a = "int"
def schema:b = "int"
end
def config1 = config_template ++> (:schema, :b, "float")
def config2 = config_template ++> (:schema, :b, "string")
def output:one = load_csv[config1]
def output:two = load_csv[config2]
See Also
See the CSV Export guide to learn how to export CSV data. For more details on how to interact with data, see the rest of the Data Import and Export guides.
This guide complements the Working With CSV Data concept guide, where you can find all the relevant information about how to work with CSV data in Rel, including querying, data analysis, and manipulation.