{
"cells": [
{
"source": "# Machine Learning (Regression)",
"id": "0",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"source": "The original source for this notebook is here.\n\n## Goal\n\nThe goal of this how-to guide is to provide an introduction to Rel's machine learning functionality. Specifically, this how-to guide will focus on linear regression and demonstrate how to achieve that using the `glm` and `mlpack` bindings.\n\n\n## Preliminaries\n\nWe recommend that you first go through the [CSV Import](https://docs.relational.aicsv-import) and [JSON Import and Export](https://docs.relational.aijson-import-export) guides, since they contain examples and functionality useful to understand how to appropriately load different kinds of data into the system. You may also find the [Machine Learning (Classification)](https://docs.relational.aiml-classification) guide useful as well.\n\n\n## Dataset\n\nFor this how-to guide we will be using the [Airfoil Self-Noise Data Set](http://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise#). This is a dataset from NASA with samples from acoustic and aerodynamic tests of two and three-dimensional airfoil blade sections. The tests were conducted in a specialized anechoic wind tunnel with varying wind tunnel speeds and angles.\n\nThe dataset contains 1503 instances of measurements. Specifically, the columns (i.e., features) in the file, which are all numeric, are the following:\n\n1. The frequency, in Hertz.\n2. The angle of attack, in degrees.\n3. The chord length, in meters.\n4. The free-stream velocity, in meters/sec.\n5. The suction side displacement thickness, in meters. \n6. The sound pressure level, in decibels.\n\nHere is a sample of the first 5 lines of the `airfoil` dataset that we will be working with:\n\n```\n800 0\t0.3048\t71.3\t0.00266337\t126.201\n1000\t0\t0.3048\t71.3\t0.00266337\t125.201\n1250\t0\t0.3048\t71.3\t0.00266337\t125.951\n1600\t0\t0.3048\t71.3\t0.00266337\t127.591\n2000\t0\t0.3048\t71.3\t0.00266337\t127.461\n...\n```\n\nFor the purpose of this guide, we will use the first 5 as input features and we will attempt to predict the last column, i.e., the scaled sound pressure level. Our goal will be to build a linear regression model to predict this sound level. \n\n## Loading the Data\n\nWe start building a linear regression model by loading the data. We can do this using [lined_csv](https://docs.relational.ai/rel/ref/lib/stdlib#lined_csv) as follows:",
"id": "1",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"source": "def config[:path] = \"s3://relationalai-documentation-public/ml-regression/airfoil/airfoil_self_noise.dat\"\n\ndef config[:syntax, :header_row] = -1\ndef config[:syntax, :header] =\n (1, :frequency);\n (2, :angle);\n (3, :chord_length);\n (4, :velocity) ;\n (5, :displacement) ;\n (6, :sound)\n\ndef config[:syntax, :delim] = '\\t'\n\ndef config[:schema, :frequency] = \"float\"\ndef config[:schema, :angle] = \"float\"\ndef config[:schema, :chord_length] = \"float\"\ndef config[:schema, :velocity] = \"float\"\ndef config[:schema, :displacement] = \"float\"\ndef config[:schema, :sound] = \"float\"\n\n// insert transaction\ndef insert[:airfoil] = lined_csv[load_csv[config]]",
"id": "2",
"type": "update",
"inputs": []
},
{
"source": "Please note that, in the code above, we have specified the file location using an AWS `S3` URL. This location must be accessible by the server that we are running.\n\nSince the `airfoil` file from the UCI Machine Learning Repository has no header, we specified (through `:header_row = -1`) that there is no header to the file. We defined our own header (using `:header`) and we also specified the schema of the imported file, indicating that all attributes are of type `float`. The row ids loaded in the `airfoil` relation will be useful later on when we need to split our dataset in training and test.\n\n## Preparing the Data\n\nOnce we have the data loaded, we need to transform the data in order to feed them into the machine learning models.\n\nIn general, we support a variety of machine learning models. The complete list of supported models can be found in [The Machine Learning Library](https://docs.relational.ai/rel/ref/lib/ml).\n\nMost of these models require two relations:\n\n- a relation containing the features to be used as inputs to train a model, and\n- a relation containing the response (or target) variable that we want to learn to predict.\n\nTo this end, we put the feature data in the `features` relation and the response data in the `responses` relation.",
"id": "3",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_4",
"source": "def features = airfoil[col] \n for col in {:frequency; :angle; :chord_length; :velocity; :displacement}\n\ndef responses = airfoil:sound",
"id": "4",
"type": "install",
"inputs": []
},
{
"source": "We can easily get statistics about our `feature` data using [`describe`](https://docs.relational.ai/rel/ref/lib/stdlib#describe_full):",
"id": "5",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"source": "table[describe[features]]",
"id": "6",
"type": "query",
"inputs": []
},
{
"source": "and, of course, we can do the same for our `responses` data:",
"id": "7",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"source": "describe_full[responses]",
"id": "8",
"type": "query",
"inputs": []
},
{
"source": "Here, we used [`describe_full`](https://docs.relational.ai/rel/ref/lib/stdlib#describe_full) as we have only one column in the `responses` relation, in order to get a more detailed explanation of the data.\n\n\n### Creating Train and Test Datasets\n\nIn our approach, we will use a \"train\" dataset to learn a linear regression model and a \"test\" dataset to determine the accuracy of our model. In certain cases, we may also use a validation dataset for parameter tuning, but we will consider only train and test for the purposes of this guide.\n\nSince the `airfoil` dataset is not already split in test and train, we will have to create these two datasets. \n\nIn the following, we split our data into training and test sets with a ratio 80/20. We specify the splitting ratio and the seed in `split_params`. The splitting is done by [`mlpack_preprocess_split`](https://docs.relational.ai/rel/ref/lib/ml#mlpack_preprocess_split), which splits the keys in the two sets. Afterwards, we join them with the `features` and `responses` so that we generate the corresponding training and test data sets:",
"id": "9",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_10",
"source": "def split_params = {(\"test_ratio\", \"0.2\"); (\"seed\", \"42\")}\n\ndef data_key(:keys, k) = features(_, k, _)\ndef data_key_split = mlpack_preprocess_split[data_key, split_params]\n\ndef features_train(f, k, v) = features(f, k, v) and data_key_split(1, k)\ndef features_test(f, k, v) = features(f, k, v) and data_key_split(2, k)\n\ndef responses_train(k, v) = responses(k, v) and data_key_split(1, k)\ndef responses_test(k, v) = responses(k, v) and data_key_split(2, k)",
"id": "10",
"type": "install",
"inputs": []
},
{
"source": "The relation `split_params` specifies the exact splitting ratio between training and test sets. Note that the parameter name as well as the value need to be encoded as strings.\n\nAt this point, we can also add various checks to ensure that we have included all the instances from the original data set when we did the splitting in training and test. For example, we can check that the number of instances in training and test add up:",
"id": "11",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_12",
"source": "ic all_data() {\n count[features_train] + count[features_test] = count[features]\n}",
"id": "12",
"type": "install",
"inputs": []
},
{
"source": "Or, we can more rigorously ensure that we have actually performed a split using all the available data:",
"id": "13",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_14",
"source": "ic all_features() {\n equal(features, union[features_train, features_test])\n}",
"id": "14",
"type": "install",
"inputs": []
},
{
"source": "## Building a Linear Regression Model\n\nWe will be mostly using _glm_ bindings to create a [Linear Regression Model](https://juliastats.org/GLM.jl/stable/examples/#Linear-regression-1).\nTo this end, we will use [`glm_linear_regression`](https://docs.relational.ai/rel/ref/lib/ml#glm_linear_regression) from [The Machine Learning Library (ml)] which takes as input two relations, i.e., the features that we want to learn from and the response. We can train a glm linear regression model as follows:",
"id": "15",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_16",
"source": "def glm_lr = glm_linear_regression[features_train, responses_train]",
"id": "16",
"type": "install",
"inputs": []
},
{
"source": "Please note that _glm_ provides only unregularized linear regression.\nWe will also discuss a regularized model in this guide using [_mlpack_'s linear regression](https://www.mlpack.org/doc/stable/julia_documentation.html#linear_regression) functionality which supports the passing of a lambda (for ridge regression) as follows:",
"id": "17",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_18",
"source": "def hyper_params = {(\"lambda\", \"0.1\")}\n\ndef mlpack_lr = mlpack_linear_regression[features_train, responses_train, hyper_params]",
"id": "18",
"type": "install",
"inputs": []
},
{
"source": "## Performing Predictions\n\nOnce we have our model ready, we can use it to perform predictions. We will use [`glm_predict`](https://docs.relational.ai/rel/ref/lib/ml#glm_predict) for our glm model, and [`mlpack_linear_regression_predict`](https://docs.relational.ai/rel/ref/lib/ml#mlpack_linear_regression_predict) for our mlpack model. In both cases, we will have to provide:\n\n1. the trained ML model,\n2. a relation with features similar to the one that was used for training, and\n3. the number of keys used in the feature relation.\n\nThe information about the number of keys is necessary so that [`glm_predict`](https://docs.relational.ai/rel/ref/lib/ml#glm_predict) and [`mlpack_linear_regression_predict`](https://docs.relational.ai/rel/ref/lib/ml#mlpack_linear_regression_predict) know how many keys are present in `features_train` and `features_test`. In our case, we have only one key, i.e., the csv row number, which we carried over from the data loading step.\n\nWe can now predict the sound (i.e. the response variable) using the training dataset:",
"id": "19",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_20",
"source": "def glm_predictions_train = glm_predict[glm_lr, features_train, 1]\n\ndef mlpack_predictions_train = mlpack_linear_regression_predict[mlpack_lr, features_train, 1]",
"id": "20",
"type": "install",
"inputs": []
},
{
"source": "and we can, of course, also predict the sound for the unseen test dataset:",
"id": "21",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_22",
"source": "def glm_predictions_test = glm_predict[glm_lr, features_test, 1]\n\ndef mlpack_predictions_test = mlpack_linear_regression_predict[mlpack_lr, features_test, 1]",
"id": "22",
"type": "install",
"inputs": []
},
{
"source": "Let's look at some predictions for the test dataset:",
"id": "23",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"source": "top[5, glm_predictions_test]",
"id": "24",
"type": "query",
"inputs": []
},
{
"source": "top[5, mlpack_predictions_test]",
"id": "25",
"type": "query",
"inputs": []
},
{
"source": "## Evaluating Our Model\n\nIn order to evaluate a linear model, one of the metrics that we can use is the $R^2$ value (or, the square of the correlation coefficient $R$). The $R^2$ is a value between 0 and 1 and attempts to measure the overall fit of a linear model by measuring the proportion of variance that is explained by the model over the observed data. Higher $R^2$ values are better because they indicate that more variance is explained by the model.\n\nThe $R^2$ is defined as follows:\n\n$$\\textrm{R}^2 = 1 - \\frac{\\sum_i (y_i - \\hat{y_i})^2}{\\sum_i (y_i - \\bar{y})^2}$$\n\nwhere $y_i$ is the expected value, $\\hat{y_i}$ is the predicted value, and $\\bar{y}$ is the mean of the expected values.\n\nIn addition to $R^2$, we can also use the Mean Square Error (MSE) or the Root Mean Square Error (RMSE), which attempt to capture the actual deviation of the predictions from the expected values:\n\n$$\\textrm{MSE} = \\frac{1}{N} \\sum_{i}^{N} (y_i - \\hat{y_i})^2$$\n\n$$\\textrm{RMSE} = \\sqrt{\\frac{1}{N} \\sum_{i}^{N} (y_i - \\hat{y_i})^2}$$\n\nAnd we can also use the Mean Absolute Error (MAE), which is a more direct representation of the deviation of the predictions from the expected values:\n\n$$\\textrm{MAE} = \\frac{1}{N} \\sum_{i}^{N} |y_i - \\hat{y_i}|$$\n\nWe now compute each of these metrics for the testing dataset and for our two models. We first provide the definitions for the different metrics:",
"id": "26",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"name": "install_27",
"source": "// R2\n@inline def R2[R, P] =\n sum[(R[pos] - P[pos])^2 for pos] / \n sum[(R[pos] - mean[R])^2 for pos]\n\n// MSE, RMSE\n@inline def MSE[R, P] = sum[(R[pos] - P[pos])^2 for pos] / count[R]\n@inline def RMSE[R, P] = sqrt[MSE[P, R]]\n\n// MAE\n@inline def MAE[R, P] = sum[abs[R[pos] - P[pos]] for pos] / count[R]",
"id": "27",
"type": "install",
"inputs": []
},
{
"source": "And next, we compute them for the different models:",
"id": "28",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
},
{
"source": "def results:R2:glm = R2[responses_test, glm_predictions_test]\ndef results:R2:mlpack = R2[responses_test, mlpack_predictions_test]\n\ndef results:MSE:glm = MSE[responses_test, glm_predictions_test]\ndef results:MSE:mlpack = MSE[responses_test, mlpack_predictions_test]\n\ndef results:RMSE:glm = RMSE[responses_test, glm_predictions_test]\ndef results:RMSE:mlpack = RMSE[responses_test, mlpack_predictions_test]\n\ndef results:MAE:glm = MAE[responses_test, glm_predictions_test]\ndef results:MAE:mlpack = MAE[responses_test, mlpack_predictions_test]\n\ndef output = table[results]",
"id": "29",
"type": "query",
"inputs": []
},
{
"source": "## Summary\n\nWe demonstrated the use of a linear regression model on the `airfoil` dataset. More specifically, we used [`glm_linear_regression`](https://docs.relational.ai/rel/ref/lib/ml#glm_linear_regression) and [`mlpack_linear_regression`](https://docs.relational.ai/rel/ref/lib/ml#mlpack_linear_regression). In addition to [mlpack](https://www.mlpack.org/), and [glm](https://github.com/JuliaStats/GLM.jl) we also support [xgboost](https://xgboost.readthedocs.io/en/latest/) (with more coming).\n\nIt is important to note here that all of our machine learning models are specifically designed to have the same API.\nIn this way, we can easily swap machine learning models (of similar type, i.e., linear regression models).\nIn our example, we used both`glm_linear_regression` and `mlpack_linear_regression` and, in this way, we easily trained two different linear models (one non-regularized and one regularized).\n\nIn addition to the machine learning models, [The Machine Learning Library (ml)](https://docs.relational.ai/rel/ref/lib/ml) has useful functionality for other tasks as well.\nFor example, we can perform k-nearest-neighbor search on a relation through [`mlpack_knn`](https://docs.relational.ai/rel/ref/lib/ml#mlpack_knn) or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through [`mlpack_kernel_pca`](https://docs.relational.ai/rel/ref/lib/ml#mlpack_kernel_pca).\n\nFor a complete list of machine learning models and related functionality please see [The Machine Learning Library](https://docs.relational.ai/rel/ref/lib/ml).",
"id": "30",
"isCodeFolded": true,
"type": "markdown",
"inputs": []
}
],
"metadata": {
"notebookFormatVersion": "0.0.1"
}
}