Machine Learning (Regression)

This how-to guide demonstrates how to load a dataset, build regression models using glm and mlpack bindings, and perform predictions using these models.

Download this guide as a RAI notebook by clicking here.

Goal

The goal of this how-to guide is to provide an introduction to Rel’s machine learning functionality. Specifically, this how-to guide will focus on linear regression and demonstrate how to achieve that using the glm and mlpack bindings.

Preliminaries

We recommend that you first go through the CSV Import and JSON Import and Export guides, since they contain examples and functionality useful to understand how to appropriately load different kinds of data into the system. You may also find the Machine Learning (Classification) guide useful as well.

Dataset

For this how-to guide we will be using the Airfoil Self-Noise Data Set. This is a dataset from NASA with samples from acoustic and aerodynamic tests of two and three-dimensional airfoil blade sections. The tests were conducted in a specialized anechoic wind tunnel with varying wind tunnel speeds and angles.

The dataset contains 1503 instances of measurements. Specifically, the columns (i.e., features) in the file, which are all numeric, are the following:

  1. The frequency, in Hertz.
  2. The angle of attack, in degrees.
  3. The chord length, in meters.
  4. The free-stream velocity, in meters/sec.
  5. The suction side displacement thickness, in meters.
  6. The sound pressure level, in decibels.

Here is a sample of the first 5 lines of the airfoil dataset that we will be working with:

800     0	0.3048	71.3	0.00266337	126.201
1000	0	0.3048	71.3	0.00266337	125.201
1250	0	0.3048	71.3	0.00266337	125.951
1600	0	0.3048	71.3	0.00266337	127.591
2000	0	0.3048	71.3	0.00266337	127.461
...

For the purpose of this guide, we will use the first 5 as input features and we will attempt to predict the last column, i.e., the scaled sound pressure level. Our goal will be to build a linear regression model to predict this sound level.

Loading the Data

We start building a linear regression model by loading the data. We can do this using lined_csv as follows:

update
def config[:path] = "s3://relationalai-documentation-public/ml-regression/airfoil/airfoil_self_noise.dat"

def config[:syntax, :header_row] = -1
def config[:syntax, :header] =
(1, :frequency);
(2, :angle);
(3, :chord_length);
(4, :velocity) ;
(5, :displacement) ;
(6, :sound)

def config[:syntax, :delim] = '\t'

def config[:schema, :frequency] = "float"
def config[:schema, :angle] = "float"
def config[:schema, :chord_length] = "float"
def config[:schema, :velocity] = "float"
def config[:schema, :displacement] = "float"
def config[:schema, :sound] = "float"

// insert transaction
def insert[:airfoil] = lined_csv[load_csv[config]]

Please note that, in the code above, we have specified the file location using an AWS S3 URL. This location must be accessible by the server that we are running.

Since the airfoil file from the UCI Machine Learning Repository has no header, we specified (through :header_row = -1) that there is no header to the file. We defined our own header (using :header) and we also specified the schema of the imported file, indicating that all attributes are of type float. The row ids loaded in the airfoil relation will be useful later on when we need to split our dataset in training and test.

Preparing the Data

Once we have the data loaded, we need to transform the data in order to feed them into the machine learning models.

In general, we support a variety of machine learning models. The complete list of supported models can be found in The Machine Learning Library.

Most of these models require two relations:

  • a relation containing the features to be used as inputs to train a model, and
  • a relation containing the response (or target) variable that we want to learn to predict.

To this end, we put the feature data in the features relation and the response data in the responses relation.

install
def features = airfoil[col]
for col in {:frequency; :angle; :chord_length; :velocity; :displacement}

def responses = airfoil:sound

We can easily get statistics about our feature data using describe:

query
table[describe[features]]

Relation: output

anglechord_lengthdisplacementfrequencyvelocity
count15031503150315031503
max22.20.30480.058411320000.071.3
mean6.7823020625415170.136548236859612260.0111398803912175562886.380572188955350.860745176314175
min0.00.02540.000400682200.031.7
percentile252.00.05080.00253511800.039.6
percentile505.40.10160.004957411600.039.6
percentile759.90.22860.01557594000.071.3
std5.9181281248864450.093540728373966510.0131502342668147823152.57313693067715.572784395385765

and, of course, we can do the same for our responses data:

query
describe_full[responses]

Relation: output

:count1503
:max140.987
:mean124.83594278110434
:min103.38
:percentile25120.191
:percentile50125.721
:percentile75129.9955
:std6.898656621628728

Here, we used describe_full as we have only one column in the responses relation, in order to get a more detailed explanation of the data.

Creating Train and Test Datasets

In our approach, we will use a “train” dataset to learn a linear regression model and a “test” dataset to determine the accuracy of our model. In certain cases, we may also use a validation dataset for parameter tuning, but we will consider only train and test for the purposes of this guide.

Since the airfoil dataset is not already split in test and train, we will have to create these two datasets.

In the following, we split our data into training and test sets with a ratio 80/20. We specify the splitting ratio and the seed in split_params. The splitting is done by mlpack_preprocess_split, which splits the keys in the two sets. Afterwards, we join them with the features and responses so that we generate the corresponding training and test data sets:

install
def split_params = {("test_ratio", "0.2"); ("seed", "42")}

def data_key(:keys, k) = features(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_params]

def features_train(f, k, v) = features(f, k, v) and data_key_split(1, k)
def features_test(f, k, v) = features(f, k, v) and data_key_split(2, k)

def responses_train(k, v) = responses(k, v) and data_key_split(1, k)
def responses_test(k, v) = responses(k, v) and data_key_split(2, k)

The relation split_params specifies the exact splitting ratio between training and test sets. Note that the parameter name as well as the value need to be encoded as strings.

At this point, we can also add various checks to ensure that we have included all the instances from the original data set when we did the splitting in training and test. For example, we can check that the number of instances in training and test add up:

install
ic all_data() {
count[features_train] + count[features_test] = count[features]
}

Or, we can more rigorously ensure that we have actually performed a split using all the available data:

install
ic all_features() {
equal(features, union[features_train, features_test])
}

Building a Linear Regression Model

We will be mostly using glm bindings to create a Linear Regression Model. To this end, we will use glm_linear_regression from [The Machine Learning Library (ml)] which takes as input two relations, i.e., the features that we want to learn from and the response. We can train a glm linear regression model as follows:

install
def glm_lr = glm_linear_regression[features_train, responses_train]

Please note that glm provides only unregularized linear regression. We will also discuss a regularized model in this guide using mlpack’s linear regression functionality which supports the passing of a lambda (for ridge regression) as follows:

install
def hyper_params = {("lambda", "0.1")}

def mlpack_lr = mlpack_linear_regression[features_train, responses_train, hyper_params]

Performing Predictions

Once we have our model ready, we can use it to perform predictions. We will use glm_predict for our glm model, and mlpack_linear_regression_predict for our mlpack model. In both cases, we will have to provide:

  1. the trained ML model,
  2. a relation with features similar to the one that was used for training, and
  3. the number of keys used in the feature relation.

The information about the number of keys is necessary so that glm_predict and mlpack_linear_regression_predict know how many keys are present in features_train and features_test. In our case, we have only one key, i.e., the csv row number, which we carried over from the data loading step.

We can now predict the sound (i.e. the response variable) using the training dataset:

install
def glm_predictions_train = glm_predict[glm_lr, features_train, 1]

def mlpack_predictions_train = mlpack_linear_regression_predict[mlpack_lr, features_train, 1]

and we can, of course, also predict the sound for the unseen test dataset:

install
def glm_predictions_test = glm_predict[glm_lr, features_test, 1]

def mlpack_predictions_test = mlpack_linear_regression_predict[mlpack_lr, features_test, 1]

Let’s look at some predictions for the test dataset:

query
top[5, glm_predictions_test]

Relation: output

15125.74603
215126.22035
316126.05504
434124.76245
536124.38098
query
top[5, mlpack_predictions_test]

Relation: output

15125.90198989870775
215126.28385289436524
316126.11886499614315
434124.74504272226545
536124.364301418676

Evaluating Our Model

In order to evaluate a linear model, one of the metrics that we can use is the $R^2$ value (or, the square of the correlation coefficient $R$). The $R^2$ is a value between 0 and 1 and attempts to measure the overall fit of a linear model by measuring the proportion of variance that is explained by the model over the observed data. Higher $R^2$ values are better because they indicate that more variance is explained by the model.

The $R^2$ is defined as follows:

$$\textrm{R}^2 = 1 - \frac{\sum_i (y_i - \hat{y_i})^2}{\sum_i (y_i - \bar{y})^2}$$

where $y_i$ is the expected value, $\hat{y_i}$ is the predicted value, and $\bar{y}$ is the mean of the expected values.

In addition to $R^2$, we can also use the Mean Square Error (MSE) or the Root Mean Square Error (RMSE), which attempt to capture the actual deviation of the predictions from the expected values:

$$\textrm{MSE} = \frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2$$

$$\textrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2}$$

And we can also use the Mean Absolute Error (MAE), which is a more direct representation of the deviation of the predictions from the expected values:

$$\textrm{MAE} = \frac{1}{N} \sum_{i}^{N} |y_i - \hat{y_i}|$$

We now compute each of these metrics for the testing dataset and for our two models. We first provide the definitions for the different metrics:

install
// R2
@inline def R2[R, P] =
sum[(R[pos] - P[pos])^2 for pos] /
sum[(R[pos] - mean[R])^2 for pos]

// MSE, RMSE
@inline def MSE[R, P] = sum[(R[pos] - P[pos])^2 for pos] / count[R]
@inline def RMSE[R, P] = sqrt[MSE[P, R]]

// MAE
@inline def MAE[R, P] = sum[abs[R[pos] - P[pos]] for pos] / count[R]

And next, we compute them for the different models:

query
def results:R2:glm = R2[responses_test, glm_predictions_test]
def results:R2:mlpack = R2[responses_test, mlpack_predictions_test]

def results:MSE:glm = MSE[responses_test, glm_predictions_test]
def results:MSE:mlpack = MSE[responses_test, mlpack_predictions_test]

def results:RMSE:glm = RMSE[responses_test, glm_predictions_test]
def results:RMSE:mlpack = RMSE[responses_test, mlpack_predictions_test]

def results:MAE:glm = MAE[responses_test, glm_predictions_test]
def results:MAE:mlpack = MAE[responses_test, mlpack_predictions_test]

def output = table[results]

Relation: output

MAEMSER2RMSE
glm3.80069246666666624.281668749896670.52763455606379894.927643326164818
mlpack3.84786198277991624.1564186134305070.52491290209380414.914917966093687

Summary

We demonstrated the use of a linear regression model on the airfoil dataset. More specifically, we used glm_linear_regression and mlpack_linear_regression. In addition to mlpack, and glm we also support xgboost (with more coming).

It is important to note here that all of our machine learning models are specifically designed to have the same API. In this way, we can easily swap machine learning models (of similar type, i.e., linear regression models). In our example, we used bothglm_linear_regression and mlpack_linear_regression and, in this way, we easily trained two different linear models (one non-regularized and one regularized).

In addition to the machine learning models, The Machine Learning Library (ml) has useful functionality for other tasks as well. For example, we can perform k-nearest-neighbor search on a relation through mlpack_knn or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through mlpack_kernel_pca.

For a complete list of machine learning models and related functionality please see The Machine Learning Library.