Rel
HOW-TO GUIDES
Machine Learning: Classification

# Machine Learning: Classification

This how-to guide demonstrates how to load a dataset, build a classification model, and perform predictions using that model.

## Goal#

This how-to guide provides an introduction to Rel’s machine learning functionality. As one part of a larger series of machine learning how-to guides, this guide focuses on classification. Specifically, this guide explores how to load a dataset, build a classification model, and perform predictions using that model.

## Preliminaries#

It’s helpful to read through the CSV Import Guide and JSON Import and Export Guide. These guides contain examples that will show you how to load different kinds of data into the system.

## Dataset#

This how-to guide uses the Palmer Archipelago (Antarctica) penguin data. You will use a copy of the penguin dataset located in RelationalAI’s public S3 bucket.

This is a multivariate dataset with instances of penguins together with their features. You will use the penguins_size.csv file.

The dataset contains 344 instances of penguins from three species (classes), Adelie, Chinstrap, and Gentoo. The Adelie species contains 152 instances of penguins, Chinstrap has 68, and Gentoo has 124.

For each instance within the dataset, in addition to the species, there are six features:

FeatureDescriptionType
islandThe name of the island (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica) where the penguin was found and measured.Categorical
culmen_length_mmThe length of the penguin’s culmen in millimeters.Numerical
culmen_depth_mmThe depth of the penguin’s culmen in millimeters.Numerical
flipper_length_mmThe length of the penguin’s flippers in millimeters.Numerical
body_mass_gThe body mass of the penguin in grams.Numerical
sexThe sex (MALE, FEMALE) of the penguin.Categorical

Your goal is to build a classifier to predict the species of the penguin, given its features.

Here is a sample of the first five lines of the penguins_size.csv file:

species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
...


As you can see, there are certain instances of penguins where the data are not available, denoted by NA in the example above. To address this, you will perform some data cleaning over the loaded data, discussed further down.

You will begin building a classifier by loading the data from the file containing the penguin data. You can load the file using load_csv as follows:

// update

def config[:path] = "s3://relationalai-documentation-public/ml-classification/penguin/penguins_size.csv"

def config[:schema, :species] = "string"
def config[:schema, :island] = "string"
def config[:schema, :culmen_length_mm] = "float"
def config[:schema, :culmen_depth_mm] = "float"
def config[:schema, :flipper_length_mm] = "float"
def config[:schema, :body_mass_g] = "float"
def config[:schema, :sex] = "string"

// insert transaction
def insert[:penguin_raw] = lined_csv[load_csv[config]]

Note that the code above specifies the path to the file that is located in RelationalAI’s public AWS S3 bucket. The s3:// url indicates a path to a public AWS bucket.

The code reads the header from the file and will use the header names as feature names. You have specified the schema of the imported file. Specifically, the first two and the last feature (species, island, sex) are of type string, while the remaining (culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g) are float. In this guide, you will learn to predict the species feature.

## Cleaning the Data#

As discussed in the previous section, there are certain instances, or lines, in the dataset that you need to clean up. One such instance was shown earlier, where all the values were set to NA. As a first step, Rel has already cleaned up these values for you. Since it wasn’t able to parse NA as float, these instances were stored as load_errors inside the penguin_raw relation:

// query

penguin_raw:load_errors

As you can see from the file positions, there were two such lines with all of their features set to ‘NA’ in the dataset.

In addition to those errors, there are also a few lines where sex is defined as NA (eight in total), and one line where sex is defined as .. For the purpose of this guide, you will drop all rows with an issue, and you can get a clean dataset as follows:

// install

def row_with_error(row) =
penguin_raw:sex(row, "NA") or
penguin_raw:sex(row, ".") or
penguin_raw:load_errors(row, _, _)
// update

def insert[:penguin] = column, row, entry... :
penguin_raw(column, row, entry...) and not
row_with_error(row)

The final penguin dataset looks as follows:

// query

table[penguin]

## Analyzing the Data#

You can easily visualize the data you just loaded in different ways. For example, take a look at the distribution of male and female penguins by species:

// query

def output = vegalite:plot[
vegalite:bar[
:species,
{ :aggregate, "count" },
{ :data, penguin; :color, :sex; }
]
]

## Preparing the Data#

Once you have the data loaded, you need to transform the data in order to feed them into the machine learning models.

In general, you can use a variety of machine learning models. For the complete list of supported models, see the Machine Learning Library.

Most of these models require two relations:

• One containing the features to be used as inputs to train a model.
• One containing the response (or target) variable (or class in this case) that you want to learn to predict.

To this end, you can put the feature data in the features relation and the class data, which are currently read as strings, in the response_string relation. Note that in the current implementation of the Machine Learning Library, the relation from which you extract the features (i.e., penguin) needs to be a base relation. This was done earlier using insert when you defined the penguin relation.

// install

def features = penguin[col]
for col in {
:island; :culmen_length_mm; :culmen_depth_mm;
:flipper_length_mm; :body_mass_g; :sex
}

def response_string = penguin:species

You can easily get statistics about your features data using describe:

// query

table[describe[features]]

You can also do the same for your response_string data:

// query

table[(:response, describe_full[response_string])]

Here, describe_full is used because there is only one column in the response_string relation. Contrary to describe, describe_full provides statistics for the overall set of data rather than per feature.

### Converting Class Names to Integers#

You will use an mlpack classifier, so you need to represent the response classes specifically as integers. You cannot use strings or floats to represent the classes.

To this end, you will first identify all the unique classes. You can get them using last:

// install

def classes = last[response_string]

Next, you add numbers as an ID for each class. You can do this using sort, which sorts the classes, and you can use the ordering index as the class ID:

// install

def id_class = sort[classes]
// query

id_class

In order to join with the relation response_string and get the IDs, you need to swap the first and second columns. You can do this using transpose:

// install

def class_id = transpose[id_class]

Note that transpose simply swaps the first and second columns and is not to be confused with the typical matrix transposition. After you swap the columns, you can join with the response_string relation:

// install

def response = response_string.class_id

You can also do all this in one step as follows:

def response = response_string.(transpose[sort[last[response_string]]])

### Creating Training and Test Datasets#

In classification, as well as other machine learning approaches, Rel uses a training dataset to learn a classification model and a test dataset to determine the accuracy of your model. In certain cases, you may also use a validation dataset for parameter tuning, but only training and test are considered for the purposes of this how-to guide.

Because the penguin dataset is not already split into training and test sets, you will have to create these two datasets.

The following example splits the data into training and test sets with a ratio of 80/20. You can specify the splitting ratio and the seed in split_params. The splitting is done by mlpack_preprocess_split, which splits the keys in the two sets. Afterwards, you can join them with the features and response to generate the corresponding training and test datasets:

// install

def split_params = {("test_ratio", "0.2"); ("seed", "42")}

def data_key(:keys, k) = features(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_params]

def feature_train(f, k, v) = features(f, k, v) and data_key_split(1, k)
def feature_test(f, k, v) = features(f, k, v) and data_key_split(2, k)

def response_train(k, v) = response(k, v) and data_key_split(1, k)
def response_test(k, v) = response(k, v) and data_key_split(2, k)

The relation split_params specifies the exact splitting ratio between training and test sets. Note that both the parameter name and the value need to be encoded as strings.

At this point, you can also add various checks to ensure that you have included all the instances from the original dataset when you did the splitting in training and test. For example, you can check that the number of instances in training and test adds up:

// install

ic all_data() {
count[feature_train] + count[feature_test] = count[features]
}

Or, you can more rigorously ensure that you have actually performed a split using all the available data:

// install

ic all_features() {
equal(features, union[feature_train, feature_test])
}

## Building a Classifier#

This guide uses mlpack to create a decision tree classifier. The decision tree classifier of mlpack, as well as most of the other classifiers, can accept a set of optional parameters to tune the specific algorithm. The parameters for each classifier, otherwise known as hyper-parameters, are documented in the Machine Learning Library reference.

You can set the hyper-parameters through a relation — called hyper_param here — as follows:

// install

def hyper_param = {
("minimum_leaf_size", "10");
("minimum_gain_split", "1e-07")
}

Note that each classifier has its own parameters that you can find through the Machine Learning Library reference. Additionally, it is important to note that the parameters currently need to be passed as strings, similar to the example above. You can also pass no parameters to the classifier. This example specified the minimum number of instances in a leaf as 10 and set the minimum gain for node splitting to 1e-07.

At this point, you are ready to build your classifier. You will use mlpack_decision_tree and specify the features for learning (i.e., the feature_train relation), the classes to learn to predict (i.e., the response_train relation), and the parameters:

// install

def classifier = mlpack_decision_tree[
feature_train,
response_train,
hyper_param
]

Now you have a trained classifier with the relation classifier, which represents the model you have learned.

## Performing Predictions#

The trained model classifier is now ready to make predictions. To make predictions, you have to use mlpack_decision_tree_predict, where you need to provide:

1. The trained ML model.
2. A relation with features similar to the one used for training.
3. A number that indicates the number of keys used in the feature relation.

The information about the number of keys is necessary because it defines the arity of the relation with the features used to perform the predictions. In this case, you have only one key: the CSV file position, carried over from the data loading step.

You can predict the penguin species using the training dataset:

// install

def prediction_train = mlpack_decision_tree_predict[
classifier,
feature_train,
1
]

You can also predict the penguin species of the unseen test dataset:

// install

def prediction_test = mlpack_decision_tree_predict[
classifier,
feature_test,
1
]

Here are some predictions for the test dataset:

// query

top[5, prediction_test]

## Evaluating the Model#

You can evaluate machine learning models using a variety of metrics. One popular way is the accuracy, which is defined as the fraction of the number of correct predictions over the total number of predictions.

You can compute the accuracy of the classifier model on the training dataset as follows:

// install

def train_accuracy =
count[pos : prediction_train[pos] = response_train[pos]] /
count[response_train]
// query

train_accuracy

What matters here is the performance of your model on the test dataset:

// install

def test_accuracy =
count[pos : prediction_test[pos] = response_test[pos]] /
count[response_test]
// query

test_accuracy

You can also compute precision and recall (otherwise known as sensitivity) metrics for each class:

// install

def score_precision[c] =
count[pos : prediction_test(pos, c) and response_test(pos, c)] /
count[pos : prediction_test(pos, c)]

def score_recall[c] =
count[pos : prediction_test(pos, c) and response_test(pos, c)] /
count[pos : response_test(pos, c)]

You can also query them:

// query

score_precision
// query

score_recall

With precision and recall metrics at hand, you can also compute the F1 score for each class:

// install

def score_f1[c] =
2 * score_precision[c] * score_recall[c] /
(score_precision[c] + score_recall[c])

You can then query them:

// query

score_f1

Finally, you can compute the full confusion matrix — where actual is the actual class, or response, and predicted is the predicted class:

// install

def confusion_matrix[predicted, actual] = count[
x : response_test(x, actual) and prediction_test(x, predicted)
]

When you query for it, you get:

// query

confusion_matrix

Note that count does not return 0 for an empty relation, which means that if no data record of class actual was predicted to be of class predicted, this pair does not appear in confusion_matrix. This reflects the fundamental principle that, in Rel, missing data, or NULL in SQL, are not explicitly stored or represented.

To assign a zero count to these missing values, you simply need to explicitly define that for any missing predicted-actual class pair, (predicted, actual), you want to assign a count of 0. This is done below with the left_override (<++) operator:

// query

table[
confusion_matrix[class_column.class_id, class_row.class_id] <++0
for class_column in classes,
class_row in classes
]

Here, you can also convert the integer class IDs back to their original class names and state that you want the relation to be displayed as a wide table.

## Training Multiple Classifiers#

With Rel, you can easily train and test multiple classifiers. Consider the following example.

You will train a set of classifiers on the same train and test datasets as before, but you will use a different set of hyper-parameters for each classifier. You will use a relation called hyper_param within a module called fine_tune to keep all the different hyper-parameter configurations:

// install

module fine_tune
def hyper_param = {
("Classifier 1", {("minimum_leaf_size", "10"); ("minimum_gain_split", "1e-07")});
("Classifier 2", {("minimum_leaf_size", "20"); ("maximum_depth", "3")});
("Classifier 3", {("minimum_leaf_size", "5"); ("maximum_depth", "0")});
}
end

In hyper_param relation, you can use an integer key (i.e, 1, 2, 3, ...) to identify each hyper-parameter configuration. This key will be useful to identify the classifiers from each configuration as well. You can now train multiple classifiers easily as follows:

// install

module fine_tune
def classifier[i] = mlpack_decision_tree[
feature_train,
response_train,
hyper_param[i]
]
end

Note that the call to mlpack_decision_tree is the same as before, except that you are iterating all the hyper-parameter configurations of the hyper_param.

You can now create predictions for each of the trained classifiers on the test set:

// install

module fine_tune
def prediction_test[i] = mlpack_decision_tree_predict[
fine_tune:classifier[i],
feature_test,
1
]
end

And, as the next step, you can compute the precision for each classifier:

// install

module fine_tune
def score_precision(i, cl, score) =
c = count[ pos :
fine_tune:prediction_test(i, pos, id) and
response_test(pos, id)
]
and n = count[pos : fine_tune:prediction_test(i, pos, id)]
and score = c/n
and id = class_id[cl]
from c, n, id
end
// query

def output = table[fine_tune:score_precision]

Finally, you can plot the performance of each classifier over some specific metric. For example, you can show the precision of each classifier for each of the three classes as follows:

// query

def precision_plot_data[:[], i] = {
(:classifier_id, cid);
(:class, cl);
(:precision, pr)
}
from cid, cl, pr where sort[fine_tune:score_precision](i, cid, cl, pr)

def chart:data:values = precision_plot_data
def chart:mark = "bar"
def chart:width = 300
def chart = vegalite_utils:x[{
(:field, "class");
}]

def chart = vegalite_utils:y[{
(:field, "precision");
(:type, "quantitative");
(:axis, :format, ".3f");
}]

def chart:encoding:xOffset = { (:field, "classifier_id"); (:type, "nominal");}
def chart:encoding:color:field = "classifier_id"

def output = vegalite:plot[chart]

Based on the analysis of performance of multiple classifiers, Rel allows you to easily determine which classifier is expected to perform the best. As an example, pick the classifier with the maximum precision on the test set over all classes:

// query

def score_precision_overall[i] =
count[pos : fine_tune:prediction_test[i, pos] = response_test[pos]] /
count[fine_tune:prediction_test[i]]

def max_precision_classifier_id = argmax[score_precision_overall]
def max_precision = score_precision_overall[max_precision_classifier_id]

def output:classifier = max_precision_classifier_id
def output:precision = max_precision

## Summary#

This guide has demonstrated the use of a decision tree classifier on the penguin dataset. More specifically, this guide used mlpack_decision_tree, i.e., a decision tree classifier from mlpack. You can use additional classifiers in a similar way. For example:

In addition to mlpack, other machine learning libraries are also supported, such as glm or xgboost, and there are more coming.

It is important to note here that all supported machine learning models are specifically designed to have the same API. In this way, you can easily swap machine learning models of similar type, i.e., classification models. In the example in this guide, you can simply switch mlpack_decision_tree with mlpack_random_forest, change the hyper_params to the right parameters for mlpack_random_forest, or just leave it empty to use the defaults, and you now have a random forest classifier.

In addition to the machine learning models, the Machine Learning Library has useful functionality for other tasks. For example, you can perform k-nearest-neighbor search on a relation through mlpack_knn or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through mlpack_kernel_pca.