Machine Learning (Classification)

This how-to guide demonstrates how to load a dataset, build a classification model, and perform predictions using that model.

Download this guide as a RAI notebook by clicking here.

Goal

The goal of this how-to guide is to provide an introduction to Rel’s machine learning functionality. As one part of a larger series of machine learning how-to guides, this guide will focus on classification. Specifically, we will explore how to load a dataset, build a classification model, and perform predictions using that model.

Preliminaries

We recommend that you also go through the CSV Import Guide and JSON Import and Export Guide, since they contain examples and functionality useful to understand how to appropriately load different kinds of data into the system.

Dataset

For this how-to guide we will be using the Palmer Archipelago (Antarctica) penguin data. We will use a copy of the penguin dataset located in our public S3 bucket.

This is a multivariate dataset with instances of penguins together with their features. We will be using the penguins_size.csv file for our guide.

The dataset contains 344 instances of penguins from three species (classes), Adelie, Chinstrap and Gentoo. The Adelie species contains 152 instances of penguins, Chinstrap has 68, and Gentoo has 124.

For each instance within the dataset, in addition to the species, there are 6 attributes:

AttributeDescriptionType
islandThe name of the island (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica) where the penguin was found and measuredCategorical
culmen_length_mmThe length of the penguin’s culmen in millimetersNumerical
culmen_depth_mmThe depth of the penguin’s culmen in millimetersNumerical
flipper_length_mmThe length of the penguin’s flippers in millimetersNumerical
body_mass_gThe body mass of the penguin in gramsNumerical
sexThe sex (MALE, FEMALE) of the penguinCategorical

Our goal in this guide is to build a classifier to predict the species of the penguin, given its attributes.

Here is a sample of the first 5 lines of the penguins_size.csv file that we will be working with:

species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Adelie,Torgersen,39.1,18.7,181,3750,MALE
Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
Adelie,Torgersen,40.3,18,195,3250,FEMALE
Adelie,Torgersen,NA,NA,NA,NA,NA
Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
...

As you can see, there are certain instances of penguins where the data is not available (as denoted by NA in the example above). To address this, we will be performing some data cleaning over the loaded data as we will discuss in a bit.

Loading the Data

We start our how-to guide on building a classifier by loading the data from the file containing the penguin data. We can load the file using load_csv as follows:

update
def config[:path] = "s3://relationalai-documentation-public/ml-classification/penguin/penguins_size.csv"

def config[:schema, :species] = "string"
def config[:schema, :island] = "string"
def config[:schema, :culmen_length_mm] = "float"
def config[:schema, :culmen_depth_mm] = "float"
def config[:schema, :flipper_length_mm] = "float"
def config[:schema, :body_mass_g] = "float"
def config[:schema, :sex] = "string"


// insert transaction
def insert[:penguins] = lined_csv[load_csv[config]]

Please note, in the code above, we have specified the path to the file which is located in our public AWS S3 bucket. We used an s3:// url, which indicates a path to a public AWS bucket.

Additionally, we are reading the header from the file and we will use the header names as our attribute names. Finally, we specified the schema of the imported file. Specifically, we indicated that the first two and last attributes (species, island, sex) are of type string, while the remaining (culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g) are float. In this guide we will learn to predict the species attribute.

Cleaning the Data

As we discussed in the previous section, there are certain instances (or lines) in the dataset that we would need to clean up. One such instance was shown earlier, where all the values where set to NA. As a first step, Rel has already cleaned up these values for us. Since it wasn’t able to parse NA as float, these instances were stored as load_errors inside the penguins_raw relation:

query
penguins:load_errors

Relation: output

43"Adelie,Torgersen,NA,NA,NA,NA,NA"
44"Adelie,Torgersen,NA,NA,NA,NA,NA"
45"Adelie,Torgersen,NA,NA,NA,NA,NA"
46"Adelie,Torgersen,NA,NA,NA,NA,NA"
3403"Gentoo,Biscoe,NA,NA,NA,NA,NA"
3404"Gentoo,Biscoe,NA,NA,NA,NA,NA"
3405"Gentoo,Biscoe,NA,NA,NA,NA,NA"
3406"Gentoo,Biscoe,NA,NA,NA,NA,NA"

As we can see from the file positions, there were two such lines having all of their attributes set to ‘NA’ in the dataset.

In addition to those errors, there are also a few lines where sex is defined as NA (8 in total) and one line where sex is defined as .. For the purpose of this guide we will drop all rows with an issue and we can get a clean dataset as follows:

update
def row_with_error(row) =
penguins:sex(row, "NA") or
penguins:sex(row, ".") or
penguins:load_errors(row, _, _)

def delete[:penguins] = column, row, entry... : penguins(column, row, entry...) and row_with_error(row)

Preparing the Data

Once we have the data loaded, we need to transform the data in order to feed them into the machine learning models.

In general, we support a variety of machine learning models. The complete list of supported models can be found in the Machine Learning Library.

Most of these models require two relations:

  • one containing the features to be used as inputs to train a model, and
  • one containing the response (or target) variable (or, class, in our case) that we want to learn to predict.

To this end, we put the feature data in the features relation and the class data (that are currently read as strings) in the responses_string relation.

install
def features = penguins[col]
for col in {
:island; :culmen_length_mm; :culmen_depth_mm;
:flipper_length_mm; :body_mass_g; :sex
}

def responses_string = penguins:species

We can easily get statistics about our features data using describe:

query
table[describe[features]]

Relation: output

body_mass_gculmen_depth_mmculmen_length_mmflipper_length_mmislandsex
count333333333333333333
max6300.021.559.6231.0"Torgersen""MALE"
mean4207.05705705705717.1648648648648743.992792792792805200.96696696696696
min2700.013.132.1172.0"Biscoe""FEMALE"
percentile253550.015.639.5190.0
percentile504050.017.344.5197.0
percentile754775.018.748.6213.0
std805.21580194289661.96923546331995.46866834264756214.015765288287882
mode"Biscoe""MALE"
mode_freq163168
unique32

and, of course, we can do the same for our responses_string data:

query
describe_full[responses_string]

Relation: output

:count333
:max"Gentoo"
:min"Adelie"
:mode"Adelie"
:mode_freq146
:unique3

Here, we used describe_full because we have only have one column in the responses_string relation.

Converting Class Names to Integers

We plan to use an mlpack classifier and because of that we need to represent the response classes as integers and cannot use strings to represent the classes.

To this end, we will first identify all the unique classes. We can get them using last:

install
def classes = last[responses_string]

Next, we add numbers as an id for each class. We can do this using sort, which sorts the classes and we can use the ordering index as the class id:

install
def id_class = sort[classes]
query
id_class

Relation: output

1"Adelie"
2"Chinstrap"
3"Gentoo"

In order to join back with the relation responses_string and get the ids, we need to swap the first and second columns. We can do this using transpose:

install
def class_id = transpose[id_class]

Please note that transpose simply swaps the first and second columns, and is not to be confused with the typical matrix transposition. After we swap the columns, we can join with the responses_string relation:

install
def responses = responses_string.class_id

Of course, we could have done all this in one step as follows:

def responses = responses_string.(transpose[sort[last[responses_string]]])

Creating Training and Test Datasets

In classification (as well as other machine learning approaches), we use a “training” dataset to learn a classification model and a “test” dataset to determine the accuracy of our model. In certain cases, we may also use a validation dataset for parameter tuning, but we will consider only training and test for the purposes of this how-to guide.

Since the penguins dataset is not already split in training and test, we will have to create these two datasets.

In the following, we split our data into training and test sets with an ratio 80/20. We specify the splitting ratio and the seed in split_params. The splitting is done by mlpack_preprocess_split, which splits the keys in the two sets. Afterwards, we join them with the features and responses so that we generate the corresponding training and test data sets:

install
def split_params = {("test_ratio", "0.2"); ("seed", "42")}

def data_key(:keys, k) = features(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_params]

def features_train(f, k, v) = features(f, k, v) and data_key_split(1, k)
def features_test(f, k, v) = features(f, k, v) and data_key_split(2, k)

def responses_train(k, v) = responses(k, v) and data_key_split(1, k)
def responses_test(k, v) = responses(k, v) and data_key_split(2, k)

The relation split_params specifies the exact splitting ratio between training and test sets. Note that the parameter name as well as the value need to be encoded as strings.

At this point, we can also add various checks to ensure that we have included all the instances from the original data set when we did the splitting in training and test. For example, we can check that the number of instances in training and test add up:

install
ic all_data() {
count[features_train] + count[features_test] = count[features]
}

Or, we can more rigorously ensure that we have actually performed a split using all the available data:

install
ic all_features() {
equal(features, union[features_train, features_test])
}

Building a Classifier

In this guide, we will be using mlpack to create a decision tree classifier. The decision tree classifier of mlpack (as well as most of the other classifiers) can accept a set of optional parameters to tune the specific algorithm. The parameters for each classifier (aka hyper-parameters) are documented in the Machine Learning Library reference.

We set the hyper-parameters through a relation (we call it hyper_params here), as follows:

install
def hyper_params = {("minimum_leaf_size", "10"); ("minimum_gain_split", "1e-07")}

Please note that each classifier has their own parameters that you can find through the Machine Learning Library reference. Additionally, it is important to note that the parameters currently need to be passed as strings, similar to the example above. We can also pass no parameters to the classifier. In our example, we specified that we want the minimum number of instances in a leaf to be 10 and we set the minimum gain for node splitting to be 1e-07.

At this point, we are ready to build our classifier. We will use mlpack_decision_tree and specify the features for learning (i.e., the features_train relation), the classes to learn to predict (i.e., the responses_train relation), and the parameters:

install
def classifier = mlpack_decision_tree[features_train, responses_train, hyper_params]

Now we have a trained classifier with the relation classifier, which represents the model we have learned.

Performing Predictions

Our trained model classifier is now ready for making predictions. To make rpredictions, we have to use mlpack_decision_tree_predict, where we will have to provide:

  1. the trained ML model,
  2. a relation with features similar to the one that was used for training, and
  3. a number that indicates the number of keys used in the feature relation.

The information about the number of keys is necessary because it defines the arity of the relation with the features used to perform the predictions. In our case, we have only one key: the csv file position, which we carried over from the data loading step.

We can predict the penguin species using the training dataset:

install
def predictions_train = mlpack_decision_tree_predict[classifier, features_train, 1]

and we can, of course, also predict the penguin species of the unseen test dataset:

install
def predictions_test = mlpack_decision_tree_predict[classifier, features_test, 1]

Let’s look at some predictions for the test dataset:

query
top[5, predictions_test]

Relation: output

161
2202
3211
4341
5391

Evaluating Our Model

We can evaluate machine learning models using a variety of metrics. One popular way is the accuracy, which is defined as the fraction of the number of correct predictions over the total number of predictions.

We can compute the accuracy of the classifier model on the training dataset as follows:

install
def train_accuracy =
count[pos : predictions_train[pos] = responses_train[pos]] /
count[responses_train]
query
train_accuracy

Relation: output

0.947565543071161

Of course, what we really care about is the performance of our model on the test dataset:

install
def test_accuracy =
count[pos : predictions_test[pos] = responses_test[pos]] /
count[responses_test]
query
test_accuracy

Relation: output

0.9545454545454546

We can also compute precision and recall (aka sensitivity) metrics for each class

install
def test_precision[c] =
count[pos : predictions_test(pos, c) and responses_test(pos, c)] /
count[pos : predictions_test(pos, c)]

def test_recall[c] =
count[pos : predictions_test(pos, c) and responses_test(pos, c)] /
count[pos : responses_test(pos, c)]

and query them.

query
def output:precision = test_precision
def output:recall = test_recall

Relation: output

:precision10.9642857142857143
:precision20.8181818181818182
:precision31.0
:recall10.9642857142857143
:recall20.9
:recall30.9642857142857143

With precision and recall metrics at hand, we can also compute the F1 score for each class

install
def test_f1[c] = 2 * test_precision[c] * test_recall[c] / (test_precision[c] + test_recall[c])

and query them.

query
test_f1

Relation: output

10.9642857142857143
20.8571428571428572
30.9818181818181818

Finally, we can compute the full confusion matrix (where actual is the actual class, or response, and predicted is the predicted class):

install
def confusion_matrix[predicted, actual] = count[ x : 
responses_test(x, actual) and predictions_test(x, predicted)
]

When we query for it, we get:

query
confusion_matrix

Relation: output

1127
121
211
229
231
3327

Note that count does not return 0 for an empty relation, which means that if no data record of class actual was predicted to be of class predicted then this pair does not appear in confusion_matrix. This point relates strongly to the fundamental principle that, in Rel, missing data (or NULL in SQL) is not explicitly stored or represented.

To assign a zero count to these missing values, we simply need to explicitly define that for any missing predicted-actual class pair, (predicted, actual), we want to assign a count of 0. This is done below with the <++,

query
table[
confusion_matrix[class_column.class_id, class_row.class_id] <++0
for class_column in classes,
class_row in classes
]

Relation: output

AdelieChinstrapGentoo
Adelie2710
Chinstrap190
Gentoo0127

where we also convert back the integer class IDs to their original class names and state that we want the relation to be displayed as a wide table.

Discussion

We demonstrated the use of a decision tree classifier on the penguin dataset. More specifically, we used mlpack_decision_tree, i.e., a decision tree classifier from mlpack. We, of course, support other classifiers as well. For example (not an exhaustive list):

In addition to mlpack, we also support other machine learning libraries such as glm or xgboost (with more coming).

It is important to note here that all of our machine learning models are specifically designed to have the same API. In this way, we can easily swap machine learning models (of similar type, i.e., classification models). In our example in this guide we can simply switch mlpack_decision_tree with mlpack_random_forest, change the hyper_params to the right parameters for mlpack_random_forest (or just leave it empty to use the defaults), and we now have a random forest classifier.

In addition to the machine learning models, the Machine Learning Library has useful functionality for other tasks as well. For example, we can perform k-nearest-neighbor search on a relation through mlpack_knn or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through mlpack_kernel_pca.

For a complete list of machine learning models and related functionality please see the Machine Learning Library reference.