Machine Learning: Regression

This How-To Guide demonstrates how to load a dataset, build regression models using glm and mlpack bindings, and perform predictions using these models.

Goal

The goal of this how-to guide is to provide an introduction to Rel’s machine learning functionality. Specifically, this how-to guide will focus on linear regression and demonstrate how to achieve that using the glm and mlpack bindings.

Preliminaries

We recommend that you first go through the CSV Import and JSON Import and Export guides, since they contain examples and functionality useful to understand how to appropriately load different kinds of data into the system. You may also find the Machine Learning (Classification) guide useful as well.

Dataset

For this how-to guide we will be using the Airfoil Self-Noise Data Set. This is a dataset from NASA with samples from acoustic and aerodynamic tests of two and three-dimensional airfoil blade sections. The tests were conducted in a specialized anechoic wind tunnel with varying wind tunnel speeds and angles.

The dataset contains 1503 instances of measurements. Specifically, the columns (i.e., features) in the file, which are all numeric, are the following:

  1. The frequency, in Hertz.
  2. The angle of attack, in degrees.
  3. The chord length, in meters.
  4. The free-stream velocity, in meters/sec.
  5. The suction side displacement thickness, in meters.
  6. The sound pressure level, in decibels.

Here is a sample of the first 5 lines of the airfoil dataset that we will be working with:

800     0	0.3048	71.3	0.00266337	126.201
1000	0	0.3048	71.3	0.00266337	125.201
1250	0	0.3048	71.3	0.00266337	125.951
1600	0	0.3048	71.3	0.00266337	127.591
2000	0	0.3048	71.3	0.00266337	127.461
...

For the purpose of this guide, we will use the first 5 as input features and we will attempt to predict the last column, i.e., the scaled sound pressure level. Our goal will be to build a linear regression model to predict this sound level.

Loading the Data

We start building a linear regression model by loading the data. We can do this using lined_csv as follows:

def config[:path] = "s3://relationalai-documentation-public/ml-regression/airfoil/airfoil_self_noise.dat"

def config[:syntax, :header_row] = -1
def config[:syntax, :header] =
(1, :frequency);
(2, :angle);
(3, :chord_length);
(4, :velocity) ;
(5, :displacement) ;
(6, :sound)

def config[:syntax, :delim] = '\t'

def config[:schema, :frequency] = "float"
def config[:schema, :angle] = "float"
def config[:schema, :chord_length] = "float"
def config[:schema, :velocity] = "float"
def config[:schema, :displacement] = "float"
def config[:schema, :sound] = "float"

// insert transaction
def insert[:airfoil] = lined_csv[load_csv[config]]

The code above specifies the file location using an AWS S3 URL.

Since the airfoil file from the UCI Machine Learning Repository has no header, we specified (through :header_row = -1) that there is no header to the file. We defined our own header (using :header) and we also specified the schema of the imported file, indicating that all features are of type float. The row ids loaded in the airfoil relation will be useful later on when we need to split our dataset in training and test.

Preparing the Data

Once we have the data loaded, we need to transform the data in order to feed them into the machine learning models.

In general, we support a variety of machine learning models. The complete list of supported models can be found in The Machine Learning Library.

Most of these models require two relations:

  • a relation containing the features to be used as inputs to train a model, and
  • a relation containing the response (or target) variable that we want to learn to predict.

To this end, we put the feature data in the feature relation and the response data in the response relation.

def feature = airfoil[col]
for col in {:frequency; :angle; :chord_length; :velocity; :displacement}

def response = airfoil:sound

We can easily get statistics about our feature data using describe:

table[describe[feature]]

Relation: output

anglechord_lengthdisplacementfrequencyvelocity
count15031503150315031503
max22.20.30480.058411320000.071.3
mean6.7823020625415170.136548236859612260.0111398803912175562886.380572188955350.860745176314175
min0.00.02540.000400682200.031.7
percentile252.00.05080.00253511800.039.6
percentile505.40.10160.004957411600.039.6
percentile759.90.22860.01557594000.071.3
std5.9181281248864750.093540728373966290.0131502342668147753152.573136930668615.57278439538569

We can also do the same for our response data:

table[(:response, describe_full[response])]

Relation: output

response
count1503
max140.987
mean124.83594278110434
min103.38
percentile25120.191
percentile50125.721
percentile75129.9955
std6.898656621628727

Creating Train and Test Datasets

In our approach, we will use a “train” dataset to learn a linear regression model and a “test” dataset to determine the accuracy of our model. In certain cases, we may also use a validation dataset for parameter tuning, but we will consider only train and test for the purposes of this guide.

Since the airfoil dataset is not already split into test and train sets, we will have to create these two datasets.

In the following, we split our data into training and test sets with a ratio of 80/20. We specify the splitting ratio and the seed in split_param. The splitting is done by mlpack_preprocess_split, which splits the keys in the two sets. Afterwards, we join them with the feature and response so that we generate the corresponding training and test data sets:

def split_param = {("test_ratio", "0.2"); ("seed", "42")}

def data_key(:keys, k) = feature(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_param]

def feature_train(f, k, v) = feature(f, k, v) and data_key_split(1, k)
def feature_test(f, k, v) = feature(f, k, v) and data_key_split(2, k)

def response_train(k, v) = response(k, v) and data_key_split(1, k)
def response_test(k, v) = response(k, v) and data_key_split(2, k)

The relation split_param specifies the exact splitting ratio between training and test sets. Note that the parameter name as well as the value need to be encoded as strings.

At this point, we can also add various checks to ensure that we have included all the instances from the original data set when we did the splitting in training and test. For example, we can check that the number of instances in training and test add up:

ic all_data() {
count[feature_train] + count[feature_test] = count[feature]
}

Or, we can more rigorously ensure that we have actually performed a split using all the available data:

ic all_feature() {
equal(feature, union[feature_train, feature_test])
}

Building a Linear Regression Model

We will be mostly using glm bindings to create a Linear Regression Model. To this end, we will use glm_linear_regression from The Machine Learning Library (ml) which takes as input two relations, i.e., the features that we want to learn from and the response. We can train a glm linear regression model as follows:

def glm_lr = glm_linear_regression[feature_train, response_train]

Note that glm provides only unregularized linear regression. We will also discuss a regularized model in this guide using mlpack’s linear regression functionality which supports the passing of a lambda (for ridge regression) as follows:

def hyper_param = {("lambda", "0.1")}

def mlpack_lr = mlpack_linear_regression[
feature_train,
response_train,
hyper_param
]

Performing Predictions

Once we have our model ready, we can use it to perform predictions. We will use glm_predict for our glm model, and mlpack_linear_regression_predict for our mlpack model. In both cases, we will have to provide:

  1. the trained ML model,
  2. a relation with features similar to the one that was used for training, and
  3. the number of keys used in the feature relation.

The information about the number of keys is necessary so that glm_predict and mlpack_linear_regression_predict know how many keys are present in features_train and features_test. In our case, we have only one key, i.e., the csv row number, which we carried over from the data loading step. We specify this fact by using 1 in the glm_predictions_train call below as the last parameter.

We can now predict the sound (i.e. the response variable) using the training dataset:

def glm_prediction_train = glm_predict[glm_lr, feature_train, 1]

def mlpack_prediction_train = mlpack_linear_regression_predict[
mlpack_lr,
feature_train,
1
]

We can also predict the sound for the unseen test dataset:

def glm_prediction_test = glm_predict[glm_lr, feature_test, 1]

def mlpack_prediction_test = mlpack_linear_regression_predict[
mlpack_lr,
feature_test,
1
]

Let’s look at some predictions for the test dataset:

top[5, glm_prediction_test]

Relation: output

11127.54219
218125.71245
320124.95066
423122.98273
527116.82499
top[5, mlpack_prediction_test]

Relation: output

11127.67112006768475
218125.76921854693357
320125.00906643613914
423123.04534014992018
527116.90077725433181

Evaluating Our Model

In order to evaluate a linear model, one of the metrics that we can use is the $R^2$ value (or, the square of the correlation coefficient $R$). The $R^2$ is a value between 0 and 1 and attempts to measure the overall fit of a linear model by measuring the proportion of variance that is explained by the model over the observed data. Higher $R^2$ values are better because they indicate that more variance is explained by the model.

The $R^2$ is defined as follows:

$$\textrm{R}^2 = 1 - \frac{\sum_i (y_i - \hat{y_i})^2}{\sum_i (y_i - \bar{y})^2}$$

where $y_i$ is the expected value, $\hat{y_i}$ is the predicted value, and $\bar{y}$ is the mean of the expected values.

In addition to $R^2$, we can also use the Mean Square Error (MSE) or the Root Mean Square Error (RMSE), which attempt to capture the actual deviation of the predictions from the expected values:

$$\textrm{MSE} = \frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2$$

$$\textrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2}$$

And we can also use the Mean Absolute Error (MAE), which is a more direct representation of the deviation of the predictions from the expected values:

$$\textrm{MAE} = \frac{1}{N} \sum_{i}^{N} |y_i - \hat{y_i}|$$

We now compute each of these metrics for the testing dataset and for our two models. We first provide the definitions for R2 and MAE since we will use the Standard Library version of mse and rmse:

// R2
@inline def R2[R, P] =
sum[(R[pos] - P[pos])^2 for pos] /
sum[(R[pos] - mean[R])^2 for pos]

// MAE
@inline def MAE[R, P] = sum[abs[R[pos] - P[pos]] for pos] / count[R]

And next, we compute them for the different models:

module result
def glm:R2 = R2[response_test, glm_prediction_test]
def glm:mse = mse[response_test, glm_prediction_test]
def glm:rmse = rmse[response_test, glm_prediction_test]
def glm:MAE = MAE[response_test, glm_prediction_test]

def mlpack:R2 = R2[response_test, mlpack_prediction_test]
def mlpack:mse = mse[response_test, mlpack_prediction_test]
def mlpack:rmse = rmse[response_test, mlpack_prediction_test]
def mlpack:MAE = MAE[response_test, mlpack_prediction_test]
end

def output = table[result]

Relation: output

glmmlpack
MAE3.5803229333333353.681386250348452
R20.434637631843961870.4505868532168987
mse21.43712560630867322.223770472847093
rmse4.6300243634681534.7142094218274915

Summary

We demonstrated the use of a linear regression model on the airfoil dataset. More specifically, we used glm_linear_regression and mlpack_linear_regression. In addition to mlpack, and glm we also support xgboost and we have more coming.

It is important to note here that all of our machine learning models are specifically designed to have the same API. In this way, we can easily swap machine learning models (of similar type, i.e., linear regression models). In our example, we used both glm_linear_regression and mlpack_linear_regression and, in this way, we easily trained two different linear models (one non-regularized and one regularized).

In addition to the machine learning models, The Machine Learning Library (ml) has useful functionality for other tasks as well. For example, we can perform k-nearest-neighbor search on a relation through mlpack_knn or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through mlpack_kernel_pca.

For a complete list of machine learning models and related functionality see The Machine Learning Library.