Machine Learning: Classification
This how-to guide demonstrates how to load a dataset, build a classification model, and perform predictions using that model.
Goal
This how-to guide provides an introduction to Rel’s machine learning functionality. As one part of a larger series of machine learning how-to guides, this guide focuses on classification. Specifically, this guide explores how to load a dataset, build a classification model, and perform predictions using that model.
Preliminaries
It’s helpful to read through the CSV Import Guide and JSON Import and Export Guide. These guides contain examples that will show you how to load different kinds of data into the system.
Dataset
This how-to guide uses the Palmer Archipelago (Antarctica) penguin data. You will use a copy of the penguin dataset located in RelationalAI’s public S3 bucket.
This is a multivariate dataset with instances of penguins together with their features.
You will use the penguins_size.csv
file.
The dataset contains 344 instances of penguins from three species (classes), Adelie
, Chinstrap
, and Gentoo
.
The Adelie
species contains 152 instances of penguins, Chinstrap
has 68, and Gentoo
has 124.
For each instance within the dataset, in addition to the species, there are six features:
Feature | Description | Type |
---|---|---|
island | The name of the island (Dream , Torgersen , or Biscoe ) in the Palmer Archipelago (Antarctica) where the penguin was found and measured. | Categorical |
culmen_length_mm | The length of the penguin’s culmen in millimeters. | Numerical |
culmen_depth_mm | The depth of the penguin’s culmen in millimeters. | Numerical |
flipper_length_mm | The length of the penguin’s flippers in millimeters. | Numerical |
body_mass_g | The body mass of the penguin in grams. | Numerical |
sex | The sex (MALE , FEMALE ) of the penguin. | Categorical |
Your goal is to build a classifier to predict the species of the penguin, given its features.
Here is a sample of the first five lines of the penguins_size.csv
file:
species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Adelie,Torgersen,39.1,18.7,181,3750,MALE
Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
Adelie,Torgersen,40.3,18,195,3250,FEMALE
Adelie,Torgersen,NA,NA,NA,NA,NA
Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
...
As you can see, there are certain instances of penguins where the data are not available, denoted by NA
in the example above.
To address this, you will perform some data cleaning over the loaded data, discussed further down.
Loading the Data
You will begin building a classifier by loading the data from the file containing the penguin data.
You can load the file using load_csv
as follows:
// update
def config[:path] = "s3://relationalai-documentation-public/ml-classification/penguin/penguins_size.csv"
def config[:schema, :species] = "string"
def config[:schema, :island] = "string"
def config[:schema, :culmen_length_mm] = "float"
def config[:schema, :culmen_depth_mm] = "float"
def config[:schema, :flipper_length_mm] = "float"
def config[:schema, :body_mass_g] = "float"
def config[:schema, :sex] = "string"
// insert transaction
def insert[:penguin_raw] = lined_csv[load_csv[config]]
Note that the code above specifies the path to the file that is located in RelationalAI’s public AWS S3
bucket.
The s3://
url indicates a path to a public AWS bucket.
The code reads the header from the file and will use the header names as feature names.
You have specified the schema of the imported file.
Specifically, the first two and the last feature (species
, island
, sex
) are of type string
, while the remaining (culmen_length_mm
, culmen_depth_mm
, flipper_length_mm
, body_mass_g
) are float
.
In this guide, you will learn to predict the species
feature.
Cleaning the Data
As discussed in the previous section, there are certain instances, or lines, in the dataset that you need to clean up.
One such instance was shown earlier, where all the values were set to NA
.
As a first step, Rel has already cleaned up these values for you.
Since it wasn’t able to parse NA
as float, these instances were stored as load_errors
inside the penguin_raw
relation:
// query
penguin_raw:load_errors
As you can see from the file positions, there were two such lines with all of their features set to ‘NA’ in the dataset.
In addition to those errors, there are also a few lines where sex is defined as NA
(eight in total), and one line where sex is defined as .
.
For the purpose of this guide, you will drop all rows with an issue, and you can get a clean dataset as follows:
// install
def row_with_error(row) =
penguin_raw:sex(row, "NA") or
penguin_raw:sex(row, ".") or
penguin_raw:load_errors(row, _, _)
// update
def insert[:penguin] = column, row, entry... :
penguin_raw(column, row, entry...) and not
row_with_error(row)
The final penguin dataset looks as follows:
// query
table[penguin]
Analyzing the Data
You can easily visualize the data you just loaded in different ways. For example, take a look at the distribution of male and female penguins by species:
// query
def output = vegalite:plot[
vegalite:bar[
:species,
{ :aggregate, "count" },
{ :data, penguin; :color, :sex; }
]
]
Preparing the Data
Once you have the data loaded, you need to transform the data in order to feed them into the machine learning models.
In general, you can use a variety of machine learning models. For the complete list of supported models, see the Machine Learning Library.
Most of these models require two relations:
- One containing the features to be used as inputs to train a model.
- One containing the response (or target) variable (or class in this case) that you want to learn to predict.
To this end, you can put the feature data in the features
relation and the class data, which are currently read as strings, in the response_string
relation.
Note that in the current implementation of the Machine Learning Library, the relation from which you extract the features (i.e., penguin
) needs to be a base relation.
This was done earlier using insert
when you defined the penguin
relation.
// install
def features = penguin[col]
for col in {
:island; :culmen_length_mm; :culmen_depth_mm;
:flipper_length_mm; :body_mass_g; :sex
}
def response_string = penguin:species
You can easily get statistics about your features
data using describe
:
// query
table[describe[features]]
You can also do the same for your response_string
data:
// query
table[(:response, describe_full[response_string])]
Here, describe_full
is used because there is only one column in the response_string
relation.
Contrary to describe
, describe_full
provides statistics for the overall set of data rather than per feature.
Converting Class Names to Integers
You will use an mlpack
classifier, so you need to represent the response classes specifically as integers.
You cannot use strings or floats to represent the classes.
To this end, you will first identify all the unique classes. You can get them using last
:
// install
def classes = last[response_string]
Next, you add numbers as an ID for each class. You can do this using sort
, which sorts the classes, and you can use the ordering index as the class ID:
// install
def id_class = sort[classes]
// query
id_class
In order to join with the relation response_string
and get the IDs, you need to swap the first and second columns.
You can do this using transpose
:
// install
def class_id = transpose[id_class]
Note that transpose
simply swaps the first and second columns and is not to be confused with the typical matrix transposition.
After you swap the columns, you can join with the response_string
relation:
// install
def response = response_string.class_id
You can also do all this in one step as follows:
def response = response_string.(transpose[sort[last[response_string]]])
Creating Training and Test Datasets
In classification, as well as other machine learning approaches, Rel uses a training dataset to learn a classification model and a test dataset to determine the accuracy of your model. In certain cases, you may also use a validation dataset for parameter tuning, but only training and test are considered for the purposes of this how-to guide.
Because the penguin
dataset is not already split into training and test sets, you will have to create these two datasets.
The following example splits the data into training and test sets with a ratio of 80/20.
You can specify the splitting ratio and the seed in split_params
.
The splitting is done by mlpack_preprocess_split
, which splits the keys in the two sets.
Afterwards, you can join them with the features
and response
to generate the corresponding training and test datasets:
// install
def split_params = {("test_ratio", "0.2"); ("seed", "42")}
def data_key(:keys, k) = features(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_params]
def feature_train(f, k, v) = features(f, k, v) and data_key_split(1, k)
def feature_test(f, k, v) = features(f, k, v) and data_key_split(2, k)
def response_train(k, v) = response(k, v) and data_key_split(1, k)
def response_test(k, v) = response(k, v) and data_key_split(2, k)
The relation split_params
specifies the exact splitting ratio between training and test sets.
Note that both the parameter name and the value need to be encoded as strings.
At this point, you can also add various checks to ensure that you have included all the instances from the original dataset when you did the splitting in training and test. For example, you can check that the number of instances in training and test adds up:
// install
ic all_data() {
count[feature_train] + count[feature_test] = count[features]
}
Or, you can more rigorously ensure that you have actually performed a split using all the available data:
// install
ic all_features() {
equal(features, union[feature_train, feature_test])
}
Building a Classifier
This guide uses mlpack to create a decision tree classifier. The decision tree classifier of mlpack, as well as most of the other classifiers, can accept a set of optional parameters to tune the specific algorithm. The parameters for each classifier, otherwise known as hyper-parameters, are documented in the Machine Learning Library reference.
You can set the hyper-parameters through a relation — called hyper_param
here — as follows:
// install
def hyper_param = {
("minimum_leaf_size", "10");
("minimum_gain_split", "1e-07")
}
Note that each classifier has its own parameters that you can find through the Machine Learning Library reference.
Additionally, it is important to note that the parameters currently need to be passed as strings, similar to the example above.
You can also pass no parameters to the classifier.
This example specified the minimum number of instances in a leaf as 10
and set the minimum gain for node splitting to 1e-07
.
At this point, you are ready to build your classifier. You will use mlpack_decision_tree
and specify the features for learning (i.e., the feature_train
relation), the classes to learn to predict (i.e., the response_train
relation), and the parameters:
// install
def classifier = mlpack_decision_tree[
feature_train,
response_train,
hyper_param
]
Now you have a trained classifier with the relation classifier
, which represents the model you have learned.
Performing Predictions
The trained model classifier
is now ready to make predictions.
To make predictions, you have to use mlpack_decision_tree_predict
, where you need to provide:
- The trained ML model.
- A relation with features similar to the one used for training.
- A number that indicates the number of keys used in the feature relation.
The information about the number of keys is necessary because it defines the arity of the relation with the features used to perform the predictions. In this case, you have only one key: the CSV file position, carried over from the data loading step.
You can predict the penguin species using the training dataset:
// install
def prediction_train = mlpack_decision_tree_predict[
classifier,
feature_train,
1
]
You can also predict the penguin species of the unseen test dataset:
// install
def prediction_test = mlpack_decision_tree_predict[
classifier,
feature_test,
1
]
Here are some predictions for the test dataset:
// query
top[5, prediction_test]
Evaluating the Model
You can evaluate machine learning models using a variety of metrics. One popular way is the accuracy, which is defined as the fraction of the number of correct predictions over the total number of predictions.
You can compute the accuracy of the classifier
model on the training dataset as follows:
// install
def train_accuracy =
count[pos : prediction_train[pos] = response_train[pos]] /
count[response_train]
// query
train_accuracy
What matters here is the performance of your model on the test dataset:
// install
def test_accuracy =
count[pos : prediction_test[pos] = response_test[pos]] /
count[response_test]
// query
test_accuracy
You can also compute precision and recall (otherwise known as sensitivity) metrics for each class:
// install
def score_precision[c] =
count[pos : prediction_test(pos, c) and response_test(pos, c)] /
count[pos : prediction_test(pos, c)]
def score_recall[c] =
count[pos : prediction_test(pos, c) and response_test(pos, c)] /
count[pos : response_test(pos, c)]
You can also query them:
// query
score_precision
// query
score_recall
With precision and recall metrics at hand, you can also compute the F1 score for each class:
// install
def score_f1[c] =
2 * score_precision[c] * score_recall[c] /
(score_precision[c] + score_recall[c])
You can then query them:
// query
score_f1
Finally, you can compute the full confusion matrix — where actual
is the actual class, or response, and predicted
is the predicted class:
// install
def confusion_matrix[predicted, actual] = count[
x : response_test(x, actual) and prediction_test(x, predicted)
]
When you query for it, you get:
// query
confusion_matrix
Note that count
does not return 0
for an empty relation, which means that if no data record of class actual
was predicted to be of class predicted
, this pair does not appear in confusion_matrix
.
This reflects the fundamental principle that, in Rel, missing data, or NULL in SQL, are not explicitly stored or represented.
To assign a zero count to these missing values, you simply need to explicitly define that for any missing predicted-actual class pair, (predicted, actual)
, you want to assign a count of 0
.
This is done below with the left_override
(<++
) operator:
// query
table[
confusion_matrix[class_column.class_id, class_row.class_id] <++0
for class_column in classes,
class_row in classes
]
Here, you can also convert the integer class IDs back to their original class names and state that you want the relation to be displayed as a wide table.
Training Multiple Classifiers
With Rel, you can easily train and test multiple classifiers. Consider the following example.
You will train a set of classifiers on the same train and test datasets as before, but you will use a different set of hyper-parameters for each classifier.
You will use a relation called hyper_param
within a module called fine_tune
to keep all the different hyper-parameter configurations:
// install
module fine_tune
def hyper_param = {
("Classifier 1", {("minimum_leaf_size", "10"); ("minimum_gain_split", "1e-07")});
("Classifier 2", {("minimum_leaf_size", "20"); ("maximum_depth", "3")});
("Classifier 3", {("minimum_leaf_size", "5"); ("maximum_depth", "0")});
}
end
In hyper_param
relation, you can use an integer key (i.e, 1, 2, 3, ...
) to identify each hyper-parameter configuration.
This key will be useful to identify the classifiers from each configuration as well.
You can now train multiple classifiers easily as follows:
// install
module fine_tune
def classifier[i] = mlpack_decision_tree[
feature_train,
response_train,
hyper_param[i]
]
end
Note that the call to mlpack_decision_tree
is the same as before, except that you are iterating all the hyper-parameter configurations of the hyper_param
.
You can now create predictions for each of the trained classifiers on the test set:
// install
module fine_tune
def prediction_test[i] = mlpack_decision_tree_predict[
fine_tune:classifier[i],
feature_test,
1
]
end
And, as the next step, you can compute the precision for each classifier:
// install
module fine_tune
def score_precision(i, cl, score) =
c = count[ pos :
fine_tune:prediction_test(i, pos, id) and
response_test(pos, id)
]
and n = count[pos : fine_tune:prediction_test(i, pos, id)]
and score = c/n
and id = class_id[cl]
from c, n, id
end
// query
def output = table[fine_tune:score_precision]
Finally, you can plot the performance of each classifier over some specific metric. For example, you can show the precision of each classifier for each of the three classes as follows:
// query
def precision_plot_data[:[], i] = {
(:classifier_id, cid);
(:class, cl);
(:precision, pr)
}
from cid, cl, pr where sort[fine_tune:score_precision](i, cid, cl, pr)
def chart:data:values = precision_plot_data
def chart:mark = "bar"
def chart:width = 300
def chart = vegalite_utils:x[{
(:field, "class");
}]
def chart = vegalite_utils:y[{
(:field, "precision");
(:type, "quantitative");
(:axis, :format, ".3f");
}]
def chart:encoding:xOffset = { (:field, "classifier_id"); (:type, "nominal");}
def chart:encoding:color:field = "classifier_id"
def output = vegalite:plot[chart]
Based on the analysis of performance of multiple classifiers, Rel allows you to easily determine which classifier is expected to perform the best. As an example, pick the classifier with the maximum precision on the test set over all classes:
// query
def score_precision_overall[i] =
count[pos : fine_tune:prediction_test[i, pos] = response_test[pos]] /
count[fine_tune:prediction_test[i]]
def max_precision_classifier_id = argmax[score_precision_overall]
def max_precision = score_precision_overall[max_precision_classifier_id]
def output:classifier = max_precision_classifier_id
def output:precision = max_precision
Summary
This guide has demonstrated the use of a decision tree classifier on the penguin dataset.
More specifically, this guide used mlpack_decision_tree
, i.e., a decision tree classifier from mlpack.
You can use additional classifiers in a similar way.
For example:
In addition to mlpack, other machine learning libraries are also supported, such as glm or xgboost, and there are more coming.
It is important to note here that all supported machine learning models are specifically designed to have the same API.
In this way, you can easily swap machine learning models of similar type, i.e., classification models.
In the example in this guide, you can simply switch mlpack_decision_tree
with mlpack_random_forest
, change the hyper_params
to the right parameters for mlpack_random_forest
, or just leave it empty to use the defaults, and you now have a random forest classifier.
See Also
In addition to the machine learning models, the Machine Learning Library has useful functionality for other tasks.
For example, you can perform k-nearest-neighbor search on a relation through mlpack_knn
or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through mlpack_kernel_pca
.
For a complete list of machine learning models and related functionality, see the Machine Learning Library.