# Machine Learning: Regression

This how-to guide demonstrates how to load a dataset, build regression models using glm and mlpack bindings, and perform predictions using these models.

## Goal

This how-to guide provides an introduction to Rel’s machine learning functionality.
Specifically, this guide focuses on linear regression and demonstrates how to achieve that using the `glm`

and `mlpack`

bindings.

## Preliminaries

It is helpful to read through the CSV Import and JSON Import and Export guides. These guides contain examples that will show you how to load different kinds of data into the system. You may also find the Machine Learning (Classification) guide useful.

## Dataset

This how-to guide uses the Airfoil Self-Noise Data Set. This is a dataset from NASA with samples from acoustic and aerodynamic tests of two- and three-dimensional airfoil blade sections. The tests were conducted in a specialized anechoic wind tunnel with varying wind tunnel speeds and angles.

The dataset contains 1503 instances of measurements. Specifically, the columns (i.e., features) in the file, which are all numeric, are the following:

- The frequency, in Hertz.
- The angle of attack, in degrees.
- The chord length, in meters.
- The free-stream velocity, in meters/sec.
- The suction side displacement thickness, in meters.
- The sound pressure level, in decibels.

Here is a sample of the first five lines of the `airfoil`

dataset:

```
800 0 0.3048 71.3 0.00266337 126.201
1000 0 0.3048 71.3 0.00266337 125.201
1250 0 0.3048 71.3 0.00266337 125.951
1600 0 0.3048 71.3 0.00266337 127.591
2000 0 0.3048 71.3 0.00266337 127.461
...
```

You will use the first five as input features, and you will attempt to predict the last column, i.e., the scaled sound pressure level. Your goal is to build a linear regression model to predict this sound level.

## Loading the Data

You will begin building a linear regression model by loading the data.
You can do this using `lined_csv`

as follows:

```
// update
def config[:path] = "s3://relationalai-documentation-public/ml-regression/airfoil/airfoil_self_noise.dat"
def config[:syntax, :header_row] = -1
def config[:syntax, :header] =
(1, :frequency);
(2, :angle);
(3, :chord_length);
(4, :velocity) ;
(5, :displacement) ;
(6, :sound)
def config[:syntax, :delim] = '\t'
def config[:schema, :frequency] = "float"
def config[:schema, :angle] = "float"
def config[:schema, :chord_length] = "float"
def config[:schema, :velocity] = "float"
def config[:schema, :displacement] = "float"
def config[:schema, :sound] = "float"
// insert transaction
def insert[:airfoil] = lined_csv[load_csv[config]]
```

The code above specifies the file location using an AWS `S3`

URL.

Since the `airfoil`

file from the UCI Machine Learning Repository has no header, the code specified (through `:header_row = -1`

) that there is no header to the file.
You defined the header (using `:header`

) and specified the schema of the imported file, indicating that all features are of type `float`

.
The row IDs loaded in the `airfoil`

relation will be useful later on when you need to split the dataset into training and test sets.

## Preparing the Data

Once you have the data loaded, you need to transform the data in order to feed them into the machine learning models.

In general, you can use a variety of machine learning models. For the complete list of supported models, see The Machine Learning Library.

Most of these models require two relations:

- A relation containing the features to be used as inputs to train a model.
- A relation containing the response (or target) variable that you want to learn to predict.

To this end, you can put the feature data in the `feature`

relation and the response data in the `response`

relation.

```
// install
def feature = airfoil[col]
for col in {:frequency; :angle; :chord_length; :velocity; :displacement}
def response = airfoil:sound
```

You can easily get statistics about the `feature`

data using `describe`

:

```
// query
table[describe[feature]]
```

You can also do the same for your `response`

data:

```
// query
table[(:response, describe_full[response])]
```

### Creating Train and Test Datasets

In this guide, you will use a *train* dataset to learn a linear regression model and a *test* dataset to determine the accuracy of your model.
In certain cases, you may also use a *validation* dataset for parameter tuning, but only train and test are considered for the purposes of this guide.

Since the `airfoil`

dataset is not already split into test and train sets, you will have to create these two datasets.

The following example splits the data into training and test sets with a ratio of 80/20.
You can specify the splitting ratio and the seed in `split_param`

.
The splitting is done by `mlpack_preprocess_split`

, which splits the keys in the two sets.
Afterwards, you can join them with the `feature`

and `response`

to generate the corresponding training and test datasets:

```
// install
def split_param = {("test_ratio", "0.2"); ("seed", "42")}
def data_key(:keys, k) = feature(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_param]
def feature_train(f, k, v) = feature(f, k, v) and data_key_split(1, k)
def feature_test(f, k, v) = feature(f, k, v) and data_key_split(2, k)
def response_train(k, v) = response(k, v) and data_key_split(1, k)
def response_test(k, v) = response(k, v) and data_key_split(2, k)
```

The relation `split_param`

specifies the exact splitting ratio between training and test sets.
Note that the parameter name and the value need to be encoded as strings.

At this point, you can also add various checks to ensure that you have included all the instances from the original dataset when you did the splitting in training and test. For example, you can check that the number of instances in training and test adds up:

```
// install
ic all_data() {
count[feature_train] + count[feature_test] = count[feature]
}
```

Or, you can more rigorously ensure that you have actually performed a split using all the available data:

```
// install
ic all_feature() {
equal(feature, union[feature_train, feature_test])
}
```

## Building a Linear Regression Model

This guide uses *glm* bindings to create a Linear Regression Model.
To this end, you will use `glm_linear_regression`

from The Machine Learning Library (ml) which takes as input two relations, i.e., the features that you want to learn from and the response.
You can train a glm linear regression model as follows:

```
// install
def glm_lr = glm_linear_regression[feature_train, response_train]
```

Note that *glm* only provides unregularized linear regression.
This guide also discusses a regularized model using *mlpack*’s linear regression functionality, which supports the passing of a lambda — for ridge regression — as follows:

```
// install
def hyper_param = {("lambda", "0.1")}
def mlpack_lr = mlpack_linear_regression[
feature_train,
response_train,
hyper_param
]
```

## Performing Predictions

Once the model is ready, you can use it to perform predictions.
You will use `glm_predict`

for your glm model, and `mlpack_linear_regression_predict`

for your mlpack model.
In both cases, you will have to provide:

- The trained ML model.
- A relation with features similar to the one that was used for training.
- The number of keys used in the feature relation.

The information about the number of keys is necessary so that `glm_predict`

and `mlpack_linear_regression_predict`

know how many keys are present in `features_train`

and `features_test`

.
In this case, you have only one key, i.e., the CSV row number, carried over from the data loading step.
You can specify this fact by using `1`

in the `glm_predictions_train`

call below as the last parameter.

You can now predict the sound (i.e., the response variable) using the training dataset:

```
// install
def glm_prediction_train = glm_predict[glm_lr, feature_train, 1]
def mlpack_prediction_train = mlpack_linear_regression_predict[
mlpack_lr,
feature_train,
1
]
```

You can also predict the sound for the unseen test dataset:

```
// install
def glm_prediction_test = glm_predict[glm_lr, feature_test, 1]
def mlpack_prediction_test = mlpack_linear_regression_predict[
mlpack_lr,
feature_test,
1
]
```

Here are some predictions for the test dataset:

```
// query
top[5, glm_prediction_test]
```

```
// query
top[5, mlpack_prediction_test]
```

## Evaluating the Model

One of the metrics you can use to evaluate a linear model is the $R^2$ value (or the square of the correlation coefficient $R$). The $R^2$ is a value between zero and one and attempts to measure the overall fit of a linear model by measuring the proportion of variance explained by the model over the observed data. Higher $R^2$ values are better because they indicate that more variance is explained by the model.

The $R^2$ is defined as follows:

$\textrm{R}^2 = 1 - \frac{\sum_i (y_i - \hat{y_i})^2}{\sum_i (y_i - \bar{y})^2}$

$y_i$ is the expected value, $\hat{y_i}$ is the predicted value, and $\bar{y}$ is the mean of the expected values.

In addition to $R^2$, you can also use the *Mean Square Error* (*MSE*) or the *Root Mean Square Error* (*RMSE*), which attempt to capture the actual deviation of the predictions from the expected values:

$\textrm{MSE} = \frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2$

$\textrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2}$

You can also use the *Mean Absolute Error* (*MAE*), which is a more direct representation of the deviation of the predictions from the expected values:

$\textrm{MAE} = \frac{1}{N} \sum_{i}^{N} |y_i - \hat{y_i}|$

You can now compute each of these metrics for the testing dataset and for your two models.
You can first provide the definitions for `R2`

and `MAE`

since you will use the Standard Library version of [`mse`

](/rel/ref/lib/stdlib#mse and `rmse`

:

```
// install
// R2
@inline def R2[R, P] =
sum[(R[pos] - P[pos])^2 for pos] /
sum[(R[pos] - mean[R])^2 for pos]
// MAE
@inline def MAE[R, P] = sum[abs[R[pos] - P[pos]] for pos] / count[R]
```

Next, you can compute them for the different models:

```
// query
module result
def glm:R2 = R2[response_test, glm_prediction_test]
def glm:mse = mse[response_test, glm_prediction_test]
def glm:rmse = rmse[response_test, glm_prediction_test]
def glm:MAE = MAE[response_test, glm_prediction_test]
def mlpack:R2 = R2[response_test, mlpack_prediction_test]
def mlpack:mse = mse[response_test, mlpack_prediction_test]
def mlpack:rmse = rmse[response_test, mlpack_prediction_test]
def mlpack:MAE = MAE[response_test, mlpack_prediction_test]
end
def output = table[result]
```

## Summary

This guide has demonstrated the use of a linear regression model on the `airfoil`

dataset.
More specifically, this guide used `glm_linear_regression`

and `mlpack_linear_regression`

.
In addition to mlpack and glm, other machine learning libraries are also supported xgboost, and there are more coming.

It is important to note here that all supported machine learning models are specifically designed to have the same API.
In this way, you can easily swap machine learning models of similar type, i.e., linear regression models.
In this example, you used both `glm_linear_regression`

and `mlpack_linear_regression`

.
In this way, you easily trained two different linear models, one non-regularized and one regularized.

## See Also

In addition to the machine learning models, The Machine Learning Library (ml) has useful functionality for other tasks.
For example, you can perform k-nearest-neighbor search on a relation through `mlpack_knn`

or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through `mlpack_kernel_pca`

.

For a complete list of machine learning models and related functionality, see The Machine Learning Library.