Rel
HOW-TO GUIDES
Machine Learning: Regression

# Machine Learning: Regression

This how-to guide demonstrates how to load a dataset, build regression models using glm and mlpack bindings, and perform predictions using these models.

## Goal#

This how-to guide provides an introduction to Rel’s machine learning functionality. Specifically, this guide focuses on linear regression and demonstrates how to achieve that using the glm and mlpack bindings.

## Preliminaries#

It is helpful to read through the CSV Import and JSON Import and Export guides. These guides contain examples that will show you how to load different kinds of data into the system. You may also find the Machine Learning (Classification) guide useful.

## Dataset#

This how-to guide uses the Airfoil Self-Noise Data Set. This is a dataset from NASA with samples from acoustic and aerodynamic tests of two- and three-dimensional airfoil blade sections. The tests were conducted in a specialized anechoic wind tunnel with varying wind tunnel speeds and angles.

The dataset contains 1503 instances of measurements. Specifically, the columns (i.e., features) in the file, which are all numeric, are the following:

1. The frequency, in Hertz.
2. The angle of attack, in degrees.
3. The chord length, in meters.
4. The free-stream velocity, in meters/sec.
5. The suction side displacement thickness, in meters.
6. The sound pressure level, in decibels.

Here is a sample of the first five lines of the airfoil dataset:

800     0	0.3048	71.3	0.00266337	126.201
1000	0	0.3048	71.3	0.00266337	125.201
1250	0	0.3048	71.3	0.00266337	125.951
1600	0	0.3048	71.3	0.00266337	127.591
2000	0	0.3048	71.3	0.00266337	127.461
...


You will use the first five as input features, and you will attempt to predict the last column, i.e., the scaled sound pressure level. Your goal is to build a linear regression model to predict this sound level.

You will begin building a linear regression model by loading the data. You can do this using lined_csv as follows:

// update

def config[:path] = "s3://relationalai-documentation-public/ml-regression/airfoil/airfoil_self_noise.dat"

def config[:syntax, :header_row] = -1
def config[:syntax, :header] =
(1, :frequency);
(2, :angle);
(3, :chord_length);
(4, :velocity) ;
(5, :displacement) ;
(6, :sound)

def config[:syntax, :delim] = '\t'

def config[:schema, :frequency] = "float"
def config[:schema, :angle] = "float"
def config[:schema, :chord_length] = "float"
def config[:schema, :velocity] = "float"
def config[:schema, :displacement] = "float"
def config[:schema, :sound] = "float"

// insert transaction
def insert[:airfoil] = lined_csv[load_csv[config]]

The code above specifies the file location using an AWS S3 URL.

Since the airfoil file from the UCI Machine Learning Repository has no header, the code specified (through :header_row = -1) that there is no header to the file. You defined the header (using :header) and specified the schema of the imported file, indicating that all features are of type float. The row IDs loaded in the airfoil relation will be useful later on when you need to split the dataset into training and test sets.

## Preparing the Data#

Once you have the data loaded, you need to transform the data in order to feed them into the machine learning models.

In general, you can use a variety of machine learning models. For the complete list of supported models, see The Machine Learning Library.

Most of these models require two relations:

• A relation containing the features to be used as inputs to train a model.
• A relation containing the response (or target) variable that you want to learn to predict.

To this end, you can put the feature data in the feature relation and the response data in the response relation.

// install

def feature = airfoil[col]
for col in {:frequency; :angle; :chord_length; :velocity; :displacement}

def response = airfoil:sound

You can easily get statistics about the feature data using describe:

// query

table[describe[feature]]

You can also do the same for your response data:

// query

table[(:response, describe_full[response])]

### Creating Train and Test Datasets#

In this guide, you will use a train dataset to learn a linear regression model and a test dataset to determine the accuracy of your model. In certain cases, you may also use a validation dataset for parameter tuning, but only train and test are considered for the purposes of this guide.

Since the airfoil dataset is not already split into test and train sets, you will have to create these two datasets.

The following example splits the data into training and test sets with a ratio of 80/20. You can specify the splitting ratio and the seed in split_param. The splitting is done by mlpack_preprocess_split, which splits the keys in the two sets. Afterwards, you can join them with the feature and response to generate the corresponding training and test datasets:

// install

def split_param = {("test_ratio", "0.2"); ("seed", "42")}

def data_key(:keys, k) = feature(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_param]

def feature_train(f, k, v) = feature(f, k, v) and data_key_split(1, k)
def feature_test(f, k, v) = feature(f, k, v) and data_key_split(2, k)

def response_train(k, v) = response(k, v) and data_key_split(1, k)
def response_test(k, v) = response(k, v) and data_key_split(2, k)

The relation split_param specifies the exact splitting ratio between training and test sets. Note that the parameter name and the value need to be encoded as strings.

At this point, you can also add various checks to ensure that you have included all the instances from the original dataset when you did the splitting in training and test. For example, you can check that the number of instances in training and test adds up:

// install

ic all_data() {
count[feature_train] + count[feature_test] = count[feature]
}

Or, you can more rigorously ensure that you have actually performed a split using all the available data:

// install

ic all_feature() {
equal(feature, union[feature_train, feature_test])
}

## Building a Linear Regression Model#

This guide uses glm bindings to create a Linear Regression Model. To this end, you will use glm_linear_regression from The Machine Learning Library (ml) which takes as input two relations, i.e., the features that you want to learn from and the response. You can train a glm linear regression model as follows:

// install

def glm_lr = glm_linear_regression[feature_train, response_train]

Note that glm only provides unregularized linear regression. This guide also discusses a regularized model using mlpack’s linear regression functionality, which supports the passing of a lambda — for ridge regression — as follows:

// install

def hyper_param = {("lambda", "0.1")}

def mlpack_lr = mlpack_linear_regression[
feature_train,
response_train,
hyper_param
]

## Performing Predictions#

Once the model is ready, you can use it to perform predictions. You will use glm_predict for your glm model, and mlpack_linear_regression_predict for your mlpack model. In both cases, you will have to provide:

1. The trained ML model.
2. A relation with features similar to the one that was used for training.
3. The number of keys used in the feature relation.

The information about the number of keys is necessary so that glm_predict and mlpack_linear_regression_predict know how many keys are present in features_train and features_test. In this case, you have only one key, i.e., the CSV row number, carried over from the data loading step. You can specify this fact by using 1 in the glm_predictions_train call below as the last parameter.

You can now predict the sound (i.e., the response variable) using the training dataset:

// install

def glm_prediction_train = glm_predict[glm_lr, feature_train, 1]

def mlpack_prediction_train = mlpack_linear_regression_predict[
mlpack_lr,
feature_train,
1
]

You can also predict the sound for the unseen test dataset:

// install

def glm_prediction_test = glm_predict[glm_lr, feature_test, 1]

def mlpack_prediction_test = mlpack_linear_regression_predict[
mlpack_lr,
feature_test,
1
]

Here are some predictions for the test dataset:

// query

top[5, glm_prediction_test]
// query

top[5, mlpack_prediction_test]

## Evaluating the Model#

One of the metrics you can use to evaluate a linear model is the $R^2$ value (or the square of the correlation coefficient $R$). The $R^2$ is a value between zero and one and attempts to measure the overall fit of a linear model by measuring the proportion of variance explained by the model over the observed data. Higher $R^2$ values are better because they indicate that more variance is explained by the model.

The $R^2$ is defined as follows:

$\textrm{R}^2 = 1 - \frac{\sum_i (y_i - \hat{y_i})^2}{\sum_i (y_i - \bar{y})^2}$

$y_i$ is the expected value, $\hat{y_i}$ is the predicted value, and $\bar{y}$ is the mean of the expected values.

In addition to $R^2$, you can also use the Mean Square Error (MSE) or the Root Mean Square Error (RMSE), which attempt to capture the actual deviation of the predictions from the expected values:

$\textrm{MSE} = \frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2$

$\textrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2}$

You can also use the Mean Absolute Error (MAE), which is a more direct representation of the deviation of the predictions from the expected values:

$\textrm{MAE} = \frac{1}{N} \sum_{i}^{N} |y_i - \hat{y_i}|$

You can now compute each of these metrics for the testing dataset and for your two models. You can first provide the definitions for R2 and MAE since you will use the Standard Library version of [mse](/rel/ref/lib/stdlib#mse and rmse:

// install

// R2
@inline def R2[R, P] =
sum[(R[pos] - P[pos])^2 for pos] /
sum[(R[pos] - mean[R])^2 for pos]

// MAE
@inline def MAE[R, P] = sum[abs[R[pos] - P[pos]] for pos] / count[R]

Next, you can compute them for the different models:

// query

module result
def glm:R2 = R2[response_test, glm_prediction_test]
def glm:mse = mse[response_test, glm_prediction_test]
def glm:rmse = rmse[response_test, glm_prediction_test]
def glm:MAE = MAE[response_test, glm_prediction_test]

def mlpack:R2 = R2[response_test, mlpack_prediction_test]
def mlpack:mse = mse[response_test, mlpack_prediction_test]
def mlpack:rmse = rmse[response_test, mlpack_prediction_test]
def mlpack:MAE = MAE[response_test, mlpack_prediction_test]
end

def output = table[result]
This guide has demonstrated the use of a linear regression model on the airfoil dataset. More specifically, this guide used glm_linear_regression and mlpack_linear_regression. In addition to mlpack and glm, other machine learning libraries are also supported xgboost, and there are more coming.
It is important to note here that all supported machine learning models are specifically designed to have the same API. In this way, you can easily swap machine learning models of similar type, i.e., linear regression models. In this example, you used both glm_linear_regression and mlpack_linear_regression. In this way, you easily trained two different linear models, one non-regularized and one regularized.
In addition to the machine learning models, The Machine Learning Library (ml) has useful functionality for other tasks. For example, you can perform k-nearest-neighbor search on a relation through mlpack_knn or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through mlpack_kernel_pca.