# Machine Learning: Regression

This How-To Guide demonstrates how to load a dataset, build regression models using glm and mlpack bindings, and perform predictions using these models.

## Goal

The goal of this how-to guide is to provide an introduction to Rel’s machine learning functionality. Specifically, this how-to guide will focus on linear regression and demonstrate how to achieve that using the `glm`

and `mlpack`

bindings.

## Preliminaries

We recommend that you first go through the CSV Import and JSON Import and Export guides, since they contain examples and functionality useful to understand how to appropriately load different kinds of data into the system. You may also find the Machine Learning (Classification) guide useful as well.

## Dataset

For this how-to guide we will be using the Airfoil Self-Noise Data Set. This is a dataset from NASA with samples from acoustic and aerodynamic tests of two and three-dimensional airfoil blade sections. The tests were conducted in a specialized anechoic wind tunnel with varying wind tunnel speeds and angles.

The dataset contains 1503 instances of measurements. Specifically, the columns (i.e., features) in the file, which are all numeric, are the following:

- The frequency, in Hertz.
- The angle of attack, in degrees.
- The chord length, in meters.
- The free-stream velocity, in meters/sec.
- The suction side displacement thickness, in meters.
- The sound pressure level, in decibels.

Here is a sample of the first 5 lines of the `airfoil`

dataset that we will be working with:

```
800 0 0.3048 71.3 0.00266337 126.201
1000 0 0.3048 71.3 0.00266337 125.201
1250 0 0.3048 71.3 0.00266337 125.951
1600 0 0.3048 71.3 0.00266337 127.591
2000 0 0.3048 71.3 0.00266337 127.461
...
```

For the purpose of this guide, we will use the first 5 as input features and we will attempt to predict the last column, i.e., the scaled sound pressure level. Our goal will be to build a linear regression model to predict this sound level.

## Loading the Data

We start building a linear regression model by loading the data. We can do this using `lined_csv`

as follows:

`def config[:path] = "s3://relationalai-documentation-public/ml-regression/airfoil/airfoil_self_noise.dat"`

def config[:syntax, :header_row] = -1

def config[:syntax, :header] =

(1, :frequency);

(2, :angle);

(3, :chord_length);

(4, :velocity) ;

(5, :displacement) ;

(6, :sound)

def config[:syntax, :delim] = '\t'

def config[:schema, :frequency] = "float"

def config[:schema, :angle] = "float"

def config[:schema, :chord_length] = "float"

def config[:schema, :velocity] = "float"

def config[:schema, :displacement] = "float"

def config[:schema, :sound] = "float"

// insert transaction

def insert[:airfoil] = lined_csv[load_csv[config]]

The code above specifies the file location using an AWS `S3`

URL.

Since the `airfoil`

file from the UCI Machine Learning Repository has no header, we specified (through `:header_row = -1`

) that there is no header to the file. We defined our own header (using `:header`

) and we also specified the schema of the imported file, indicating that all features are of type `float`

. The row ids loaded in the `airfoil`

relation will be useful later on when we need to split our dataset in training and test.

## Preparing the Data

Once we have the data loaded, we need to transform the data in order to feed them into the machine learning models.

In general, we support a variety of machine learning models. The complete list of supported models can be found in The Machine Learning Library.

Most of these models require two relations:

- a relation containing the features to be used as inputs to train a model, and
- a relation containing the response (or target) variable that we want to learn to predict.

To this end, we put the feature data in the `feature`

relation and the response data in the `response`

relation.

`def feature = airfoil[col]`

for col in {:frequency; :angle; :chord_length; :velocity; :displacement}

def response = airfoil:sound

We can easily get statistics about our `feature`

data using `describe`

:

`table[describe[feature]]`

Relation: output

angle | chord_length | displacement | frequency | velocity | |
---|---|---|---|---|---|

count | 1503 | 1503 | 1503 | 1503 | 1503 |

max | 22.2 | 0.3048 | 0.0584113 | 20000.0 | 71.3 |

mean | 6.782302062541517 | 0.13654823685961226 | 0.011139880391217556 | 2886.3805721889553 | 50.860745176314175 |

min | 0.0 | 0.0254 | 0.000400682 | 200.0 | 31.7 |

percentile25 | 2.0 | 0.0508 | 0.00253511 | 800.0 | 39.6 |

percentile50 | 5.4 | 0.1016 | 0.00495741 | 1600.0 | 39.6 |

percentile75 | 9.9 | 0.2286 | 0.0155759 | 4000.0 | 71.3 |

std | 5.918128124886475 | 0.09354072837396629 | 0.013150234266814775 | 3152.5731369306686 | 15.57278439538569 |

We can also do the same for our `response`

data:

`table[(:response, describe_full[response])]`

Relation: output

response | |
---|---|

count | 1503 |

max | 140.987 |

mean | 124.83594278110434 |

min | 103.38 |

percentile25 | 120.191 |

percentile50 | 125.721 |

percentile75 | 129.9955 |

std | 6.898656621628727 |

### Creating Train and Test Datasets

In our approach, we will use a “train” dataset to learn a linear regression model and a “test” dataset to determine the accuracy of our model. In certain cases, we may also use a validation dataset for parameter tuning, but we will consider only train and test for the purposes of this guide.

Since the `airfoil`

dataset is not already split into test and train sets, we will have to create these two datasets.

In the following, we split our data into training and test sets with a ratio of 80/20. We specify the splitting ratio and the seed in `split_param`

. The splitting is done by `mlpack_preprocess_split`

, which splits the keys in the two sets. Afterwards, we join them with the `feature`

and `response`

so that we generate the corresponding training and test data sets:

`def split_param = {("test_ratio", "0.2"); ("seed", "42")}`

def data_key(:keys, k) = feature(_, k, _)

def data_key_split = mlpack_preprocess_split[data_key, split_param]

def feature_train(f, k, v) = feature(f, k, v) and data_key_split(1, k)

def feature_test(f, k, v) = feature(f, k, v) and data_key_split(2, k)

def response_train(k, v) = response(k, v) and data_key_split(1, k)

def response_test(k, v) = response(k, v) and data_key_split(2, k)

The relation `split_param`

specifies the exact splitting ratio between training and test sets. Note that the parameter name as well as the value need to be encoded as strings.

At this point, we can also add various checks to ensure that we have included all the instances from the original data set when we did the splitting in training and test. For example, we can check that the number of instances in training and test add up:

`ic all_data() {`

count[feature_train] + count[feature_test] = count[feature]

}

Or, we can more rigorously ensure that we have actually performed a split using all the available data:

`ic all_feature() {`

equal(feature, union[feature_train, feature_test])

}

## Building a Linear Regression Model

We will be mostly using *glm* bindings to create a Linear Regression Model.
To this end, we will use `glm_linear_regression`

from The Machine Learning Library (ml) which takes as input two relations, i.e., the features that we want to learn from and the response. We can train a glm linear regression model as follows:

`def glm_lr = glm_linear_regression[feature_train, response_train]`

Note that *glm* provides only unregularized linear regression.
We will also discuss a regularized model in this guide using *mlpack*’s linear regression functionality which supports the passing of a lambda (for ridge regression) as follows:

`def hyper_param = {("lambda", "0.1")}`

def mlpack_lr = mlpack_linear_regression[

feature_train,

response_train,

hyper_param

]

## Performing Predictions

Once we have our model ready, we can use it to perform predictions. We will use `glm_predict`

for our glm model, and `mlpack_linear_regression_predict`

for our mlpack model. In both cases, we will have to provide:

- the trained ML model,
- a relation with features similar to the one that was used for training, and
- the number of keys used in the feature relation.

The information about the number of keys is necessary so that `glm_predict`

and `mlpack_linear_regression_predict`

know how many keys are present in `features_train`

and `features_test`

. In our case, we have only one key, i.e., the csv row number, which we carried over from the data loading step.
We specify this fact by using `1`

in the `glm_predictions_train`

call below as the last parameter.

We can now predict the sound (i.e. the response variable) using the training dataset:

`def glm_prediction_train = glm_predict[glm_lr, feature_train, 1]`

def mlpack_prediction_train = mlpack_linear_regression_predict[

mlpack_lr,

feature_train,

1

]

We can also predict the sound for the unseen test dataset:

`def glm_prediction_test = glm_predict[glm_lr, feature_test, 1]`

def mlpack_prediction_test = mlpack_linear_regression_predict[

mlpack_lr,

feature_test,

1

]

Let’s look at some predictions for the test dataset:

`top[5, glm_prediction_test]`

Relation: output

1 | 1 | 127.54219 |

2 | 18 | 125.71245 |

3 | 20 | 124.95066 |

4 | 23 | 122.98273 |

5 | 27 | 116.82499 |

`top[5, mlpack_prediction_test]`

Relation: output

1 | 1 | 127.67112006768475 |

2 | 18 | 125.76921854693357 |

3 | 20 | 125.00906643613914 |

4 | 23 | 123.04534014992018 |

5 | 27 | 116.90077725433181 |

## Evaluating Our Model

In order to evaluate a linear model, one of the metrics that we can use is the $R^2$ value (or, the square of the correlation coefficient $R$). The $R^2$ is a value between 0 and 1 and attempts to measure the overall fit of a linear model by measuring the proportion of variance that is explained by the model over the observed data. Higher $R^2$ values are better because they indicate that more variance is explained by the model.

The $R^2$ is defined as follows:

$$\textrm{R}^2 = 1 - \frac{\sum_i (y_i - \hat{y_i})^2}{\sum_i (y_i - \bar{y})^2}$$

where $y_i$ is the expected value, $\hat{y_i}$ is the predicted value, and $\bar{y}$ is the mean of the expected values.

In addition to $R^2$, we can also use the Mean Square Error (MSE) or the Root Mean Square Error (RMSE), which attempt to capture the actual deviation of the predictions from the expected values:

$$\textrm{MSE} = \frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2$$

$$\textrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i}^{N} (y_i - \hat{y_i})^2}$$

And we can also use the Mean Absolute Error (MAE), which is a more direct representation of the deviation of the predictions from the expected values:

$$\textrm{MAE} = \frac{1}{N} \sum_{i}^{N} |y_i - \hat{y_i}|$$

We now compute each of these metrics for the testing dataset and for our two models. We first provide the definitions for `R2`

and `MAE`

since we will use the Standard Library version of `mse`

and `rmse`

:

`// R2`

@inline def R2[R, P] =

sum[(R[pos] - P[pos])^2 for pos] /

sum[(R[pos] - mean[R])^2 for pos]

// MAE

@inline def MAE[R, P] = sum[abs[R[pos] - P[pos]] for pos] / count[R]

And next, we compute them for the different models:

`module result`

def glm:R2 = R2[response_test, glm_prediction_test]

def glm:mse = mse[response_test, glm_prediction_test]

def glm:rmse = rmse[response_test, glm_prediction_test]

def glm:MAE = MAE[response_test, glm_prediction_test]

def mlpack:R2 = R2[response_test, mlpack_prediction_test]

def mlpack:mse = mse[response_test, mlpack_prediction_test]

def mlpack:rmse = rmse[response_test, mlpack_prediction_test]

def mlpack:MAE = MAE[response_test, mlpack_prediction_test]

end

def output = table[result]

Relation: output

glm | mlpack | |
---|---|---|

MAE | 3.580322933333335 | 3.681386250348452 |

R2 | 0.43463763184396187 | 0.4505868532168987 |

mse | 21.437125606308673 | 22.223770472847093 |

rmse | 4.630024363468153 | 4.7142094218274915 |

## Summary

We demonstrated the use of a linear regression model on the `airfoil`

dataset. More specifically, we used `glm_linear_regression`

and `mlpack_linear_regression`

. In addition to mlpack, and glm we also support xgboost and we have more coming.

It is important to note here that all of our machine learning models are specifically designed to have the same API.
In this way, we can easily swap machine learning models (of similar type, i.e., linear regression models).
In our example, we used both `glm_linear_regression`

and `mlpack_linear_regression`

and, in this way, we easily trained two different linear models (one non-regularized and one regularized).

In addition to the machine learning models, The Machine Learning Library (ml) has useful functionality for other tasks as well.
For example, we can perform k-nearest-neighbor search on a relation through `mlpack_knn`

or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through `mlpack_kernel_pca`

.

For a complete list of machine learning models and related functionality see The Machine Learning Library.