Machine Learning: Classification

This How-To Guide demonstrates how to load a dataset, build a classification model, and perform predictions using that model.

Goal

The goal of this how-to guide is to provide an introduction to Rel’s machine learning functionality. As one part of a larger series of machine learning how-to guides, this guide will focus on classification. Specifically, we will explore how to load a dataset, build a classification model, and perform predictions using that model.

Preliminaries

We recommend that you also go through the CSV Import Guide and JSON Import and Export Guide, since they contain examples and functionality useful to understand how to appropriately load different kinds of data into the system.

Dataset

For this how-to guide we will be using the Palmer Archipelago (Antarctica) penguin data. We will use a copy of the penguin dataset located in our public S3 bucket.

This is a multivariate dataset with instances of penguins together with their features. We will be using the penguins_size.csv file for our guide.

The dataset contains 344 instances of penguins from three species (classes), Adelie, Chinstrap and Gentoo. The Adelie species contains 152 instances of penguins, Chinstrap has 68, and Gentoo has 124.

For each instance within the dataset, in addition to the species, there are 6 features:

FeatureDescriptionType
islandThe name of the island (Dream, Torgersen, or Biscoe) in the Palmer Archipelago (Antarctica) where the penguin was found and measuredCategorical
culmen_length_mmThe length of the penguin’s culmen in millimetersNumerical
culmen_depth_mmThe depth of the penguin’s culmen in millimetersNumerical
flipper_length_mmThe length of the penguin’s flippers in millimetersNumerical
body_mass_gThe body mass of the penguin in gramsNumerical
sexThe sex (MALE, FEMALE) of the penguinCategorical

Our goal in this guide is to build a classifier to predict the species of the penguin, given its features.

Here is a sample of the first five lines of the penguins_size.csv file that we will be working with:

species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Adelie,Torgersen,39.1,18.7,181,3750,MALE
Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
Adelie,Torgersen,40.3,18,195,3250,FEMALE
Adelie,Torgersen,NA,NA,NA,NA,NA
Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
...

As you can see, there are certain instances of penguins where the data is not available (as denoted by NA in the example above). To address this, we will be performing some data cleaning over the loaded data as we will discuss in a bit.

Loading the Data

Let’s begin building a classifier by loading the data from the file containing the penguin data. We can load the file using load_csv as follows:

def config[:path] = "s3://relationalai-documentation-public/ml-classification/penguin/penguins_size.csv"

def config[:schema, :species] = "string"
def config[:schema, :island] = "string"
def config[:schema, :culmen_length_mm] = "float"
def config[:schema, :culmen_depth_mm] = "float"
def config[:schema, :flipper_length_mm] = "float"
def config[:schema, :body_mass_g] = "float"
def config[:schema, :sex] = "string"

// insert transaction
def insert[:penguin_raw] = lined_csv[load_csv[config]]

Note, in the code above, we have specified the path to the file which is located in our public AWS S3 bucket. We used an s3:// url, which indicates a path to a public AWS bucket.

Additionally, we are reading the header from the file and we will use the header names as our feature names. Finally, we specified the schema of the imported file. Specifically, we indicated that the first two and last features (species, island, sex) are of type string, while the remaining (culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g) are float. In this guide we will learn to predict the species feature.

Cleaning the Data

As we discussed in the previous section, there are certain instances (or lines) in the dataset that we would need to clean up. One such instance was shown earlier, where all the values where set to NA. As a first step, Rel has already cleaned up these values for us. Since it wasn’t able to parse NA as float, these instances were stored as load_errors inside the penguin_raw relation:

penguin_raw:load_errors

Relation: output

43"Adelie,Torgersen,NA,NA,NA,NA,NA"
3403"Gentoo,Biscoe,NA,NA,NA,NA,NA"

As we can see from the file positions, there were two such lines having all of their features set to ‘NA’ in the dataset.

In addition to those errors, there are also a few lines where sex is defined as NA (8 in total) and one line where sex is defined as .. For the purpose of this guide we will drop all rows with an issue and we can get a clean dataset as follows:

def row_with_error(row) =
penguin_raw:sex(row, "NA") or
penguin_raw:sex(row, ".") or
penguin_raw:load_errors(row, _, _)
def insert[:penguin] = column, row, entry... :
penguin_raw(column, row, entry...) and not
row_with_error(row)

The final penguin dataset looks as follows:

table[penguin]

Relation: output

body_mass_gculmen_depth_mmculmen_length_mmflipper_length_mmislandsexspecies
13750.018.739.1181.0"Torgersen""MALE""Adelie"
23800.017.439.5186.0"Torgersen""FEMALE""Adelie"
33250.018.040.3195.0"Torgersen""FEMALE""Adelie"
53450.019.336.7193.0"Torgersen""FEMALE""Adelie"
63650.020.639.3190.0"Torgersen""MALE""Adelie"
73625.017.838.9181.0"Torgersen""FEMALE""Adelie"
84675.019.639.2195.0"Torgersen""MALE""Adelie"
133200.017.641.1182.0"Torgersen""FEMALE""Adelie"
143800.021.238.6191.0"Torgersen""MALE""Adelie"
154400.021.134.6198.0"Torgersen""MALE""Adelie"
163700.017.836.6185.0"Torgersen""FEMALE""Adelie"
173450.019.038.7195.0"Torgersen""FEMALE""Adelie"
184500.020.742.5197.0"Torgersen""MALE""Adelie"
193325.018.434.4184.0"Torgersen""FEMALE""Adelie"
204200.021.546.0194.0"Torgersen""MALE""Adelie"
213400.018.337.8174.0"Biscoe""FEMALE""Adelie"
223600.018.737.7180.0"Biscoe""MALE""Adelie"
233800.019.235.9189.0"Biscoe""FEMALE""Adelie"
243950.018.138.2185.0"Biscoe""MALE""Adelie"
253800.017.238.8180.0"Biscoe""MALE""Adelie"
263800.018.935.3187.0"Biscoe""FEMALE""Adelie"
273550.018.640.6183.0"Biscoe""MALE""Adelie"
283200.017.940.5187.0"Biscoe""FEMALE""Adelie"
293150.018.637.9172.0"Biscoe""FEMALE""Adelie"
303950.018.940.5180.0"Biscoe""MALE""Adelie"
313250.016.739.5178.0"Dream""FEMALE""Adelie"
323900.018.137.2178.0"Dream""MALE""Adelie"
333300.017.839.5188.0"Dream""FEMALE""Adelie"
343900.018.940.9184.0"Dream""MALE""Adelie"
353325.017.036.4195.0"Dream""FEMALE""Adelie"
364150.021.139.2196.0"Dream""MALE""Adelie"
373950.020.038.8190.0"Dream""MALE""Adelie"
383550.018.542.2180.0"Dream""FEMALE""Adelie"
393300.019.337.6181.0"Dream""FEMALE""Adelie"
404650.019.139.8184.0"Dream""MALE""Adelie"
413150.018.036.5182.0"Dream""FEMALE""Adelie"
423900.018.440.8195.0"Dream""MALE""Adelie"
433100.018.536.0186.0"Dream""FEMALE""Adelie"
444400.019.744.1196.0"Dream""MALE""Adelie"
453000.016.937.0185.0"Dream""FEMALE""Adelie"
464600.018.839.6190.0"Dream""MALE""Adelie"
473425.019.041.1182.0"Dream""MALE""Adelie"
493450.017.936.0190.0"Dream""FEMALE""Adelie"
504150.021.242.3191.0"Dream""MALE""Adelie"
513500.017.739.6186.0"Biscoe""FEMALE""Adelie"
524300.018.940.1188.0"Biscoe""MALE""Adelie"
533450.017.935.0190.0"Biscoe""FEMALE""Adelie"
544050.019.542.0200.0"Biscoe""MALE""Adelie"
552900.018.134.5187.0"Biscoe""FEMALE""Adelie"
563700.018.641.4191.0"Biscoe""MALE""Adelie"
573550.017.539.0186.0"Biscoe""FEMALE""Adelie"
583800.018.840.6193.0"Biscoe""MALE""Adelie"
592850.016.636.5181.0"Biscoe""FEMALE""Adelie"
603750.019.137.6194.0"Biscoe""MALE""Adelie"
613150.016.935.7185.0"Biscoe""FEMALE""Adelie"
624400.021.141.3195.0"Biscoe""MALE""Adelie"
633600.017.037.6185.0"Biscoe""FEMALE""Adelie"
644050.018.241.1192.0"Biscoe""MALE""Adelie"
652850.017.136.4184.0"Biscoe""FEMALE""Adelie"
663950.018.041.6192.0"Biscoe""MALE""Adelie"
673350.016.235.5195.0"Biscoe""FEMALE""Adelie"
684100.019.141.1188.0"Biscoe""MALE""Adelie"
693050.016.635.9190.0"Torgersen""FEMALE""Adelie"
704450.019.441.8198.0"Torgersen""MALE""Adelie"
713600.019.033.5190.0"Torgersen""FEMALE""Adelie"
723900.018.439.7190.0"Torgersen""MALE""Adelie"
733550.017.239.6196.0"Torgersen""FEMALE""Adelie"
744150.018.945.8197.0"Torgersen""MALE""Adelie"
753700.017.535.5190.0"Torgersen""FEMALE""Adelie"
764250.018.542.8195.0"Torgersen""MALE""Adelie"
773700.016.840.9191.0"Torgersen""FEMALE""Adelie"
783900.019.437.2184.0"Torgersen""MALE""Adelie"
793550.016.136.2187.0"Torgersen""FEMALE""Adelie"
804000.019.142.1195.0"Torgersen""MALE""Adelie"
813200.017.234.6189.0"Torgersen""FEMALE""Adelie"
824700.017.642.9196.0"Torgersen""MALE""Adelie"
833800.018.836.7187.0"Torgersen""FEMALE""Adelie"
844200.019.435.1193.0"Torgersen""MALE""Adelie"
853350.017.837.3191.0"Dream""FEMALE""Adelie"
863550.020.341.3194.0"Dream""MALE""Adelie"
873800.019.536.3190.0"Dream""MALE""Adelie"
883500.018.636.9189.0"Dream""FEMALE""Adelie"
893950.019.238.3189.0"Dream""MALE""Adelie"
903600.018.838.9190.0"Dream""FEMALE""Adelie"
913550.018.035.7202.0"Dream""FEMALE""Adelie"
924300.018.141.1205.0"Dream""MALE""Adelie"
933400.017.134.0185.0"Dream""FEMALE""Adelie"
944450.018.139.6186.0"Dream""MALE""Adelie"
953300.017.336.2187.0"Dream""FEMALE""Adelie"
964300.018.940.8208.0"Dream""MALE""Adelie"
973700.018.638.1190.0"Dream""FEMALE""Adelie"
984350.018.540.3196.0"Dream""MALE""Adelie"
992900.016.133.1178.0"Dream""FEMALE""Adelie"
1004100.018.543.2192.0"Dream""MALE""Adelie"
1013725.017.935.0192.0"Biscoe""FEMALE""Adelie"
1024725.020.041.0203.0"Biscoe""MALE""Adelie"
1033075.016.037.7183.0"Biscoe""FEMALE""Adelie"
1044250.020.037.8190.0"Biscoe""MALE""Adelie"
1052925.018.637.9193.0"Biscoe""FEMALE""Adelie"
1063550.018.939.7184.0"Biscoe""MALE""Adelie"
1073750.017.238.6199.0"Biscoe""FEMALE""Adelie"
1083900.020.038.2190.0"Biscoe""MALE""Adelie"
1093175.017.038.1181.0"Biscoe""FEMALE""Adelie"
1104775.019.043.2197.0"Biscoe""MALE""Adelie"
1113825.016.538.1198.0"Biscoe""FEMALE""Adelie"
1124600.020.345.6191.0"Biscoe""MALE""Adelie"
1133200.017.739.7193.0"Biscoe""FEMALE""Adelie"
1144275.019.542.2197.0"Biscoe""MALE""Adelie"
1153900.020.739.6191.0"Biscoe""FEMALE""Adelie"
1164075.018.342.7196.0"Biscoe""MALE""Adelie"
1172900.017.038.6188.0"Torgersen""FEMALE""Adelie"
1183775.020.537.3199.0"Torgersen""MALE""Adelie"
1193350.017.035.7189.0"Torgersen""FEMALE""Adelie"
1203325.018.641.1189.0"Torgersen""MALE""Adelie"
1213150.017.236.2187.0"Torgersen""FEMALE""Adelie"
1223500.019.837.7198.0"Torgersen""MALE""Adelie"
1233450.017.040.2176.0"Torgersen""FEMALE""Adelie"
1243875.018.541.4202.0"Torgersen""MALE""Adelie"
1253050.015.935.2186.0"Torgersen""FEMALE""Adelie"
1264000.019.040.6199.0"Torgersen""MALE""Adelie"
1273275.017.638.8191.0"Torgersen""FEMALE""Adelie"
1284300.018.341.5195.0"Torgersen""MALE""Adelie"
1293050.017.139.0191.0"Torgersen""FEMALE""Adelie"
1304000.018.044.1210.0"Torgersen""MALE""Adelie"
1313325.017.938.5190.0"Torgersen""FEMALE""Adelie"
1323500.019.243.1197.0"Torgersen""MALE""Adelie"
1333500.018.536.8193.0"Dream""FEMALE""Adelie"
1344475.018.537.5199.0"Dream""MALE""Adelie"
1353425.017.638.1187.0"Dream""FEMALE""Adelie"
1363900.017.541.1190.0"Dream""MALE""Adelie"
1373175.017.535.6191.0"Dream""FEMALE""Adelie"
1383975.020.140.2200.0"Dream""MALE""Adelie"
1393400.016.537.0185.0"Dream""FEMALE""Adelie"
1404250.017.939.7193.0"Dream""MALE""Adelie"
1413400.017.140.2193.0"Dream""FEMALE""Adelie"
1423475.017.240.6187.0"Dream""MALE""Adelie"
1433050.015.532.1188.0"Dream""FEMALE""Adelie"
1443725.017.040.7190.0"Dream""MALE""Adelie"
1453000.016.837.3192.0"Dream""FEMALE""Adelie"
1463650.018.739.0185.0"Dream""MALE""Adelie"
1474250.018.639.2190.0"Dream""MALE""Adelie"
1483475.018.436.6184.0"Dream""FEMALE""Adelie"
1493450.017.836.0195.0"Dream""FEMALE""Adelie"
1503750.018.137.8193.0"Dream""MALE""Adelie"
1513700.017.136.0187.0"Dream""FEMALE""Adelie"
1524000.018.541.5201.0"Dream""MALE""Adelie"
1533500.017.946.5192.0"Dream""FEMALE""Chinstrap"
1543900.019.550.0196.0"Dream""MALE""Chinstrap"
1553650.019.251.3193.0"Dream""MALE""Chinstrap"
1563525.018.745.4188.0"Dream""FEMALE""Chinstrap"
1573725.019.852.7197.0"Dream""MALE""Chinstrap"
1583950.017.845.2198.0"Dream""FEMALE""Chinstrap"
1593250.018.246.1178.0"Dream""FEMALE""Chinstrap"
1603750.018.251.3197.0"Dream""MALE""Chinstrap"
1614150.018.946.0195.0"Dream""FEMALE""Chinstrap"
1623700.019.951.3198.0"Dream""MALE""Chinstrap"
1633800.017.846.6193.0"Dream""FEMALE""Chinstrap"
1643775.020.351.7194.0"Dream""MALE""Chinstrap"
1653700.017.347.0185.0"Dream""FEMALE""Chinstrap"
1664050.018.152.0201.0"Dream""MALE""Chinstrap"
1673575.017.145.9190.0"Dream""FEMALE""Chinstrap"
1684050.019.650.5201.0"Dream""MALE""Chinstrap"
1693300.020.050.3197.0"Dream""MALE""Chinstrap"
1703700.017.858.0181.0"Dream""FEMALE""Chinstrap"
1713450.018.646.4190.0"Dream""FEMALE""Chinstrap"
1724400.018.249.2195.0"Dream""MALE""Chinstrap"
1733600.017.342.4181.0"Dream""FEMALE""Chinstrap"
1743400.017.548.5191.0"Dream""MALE""Chinstrap"
1752900.016.643.2187.0"Dream""FEMALE""Chinstrap"
1763800.019.450.6193.0"Dream""MALE""Chinstrap"
1773300.017.946.7195.0"Dream""FEMALE""Chinstrap"
1784150.019.052.0197.0"Dream""MALE""Chinstrap"
1793400.018.450.5200.0"Dream""FEMALE""Chinstrap"
1803800.019.049.5200.0"Dream""MALE""Chinstrap"
1813700.017.846.4191.0"Dream""FEMALE""Chinstrap"
1824550.020.052.8205.0"Dream""MALE""Chinstrap"
1833200.016.640.9187.0"Dream""FEMALE""Chinstrap"
1844300.020.854.2201.0"Dream""MALE""Chinstrap"
1853350.016.742.5187.0"Dream""FEMALE""Chinstrap"
1864100.018.851.0203.0"Dream""MALE""Chinstrap"
1873600.018.649.7195.0"Dream""MALE""Chinstrap"
1883900.016.847.5199.0"Dream""FEMALE""Chinstrap"
1893850.018.347.6195.0"Dream""FEMALE""Chinstrap"
1904800.020.752.0210.0"Dream""MALE""Chinstrap"
1912700.016.646.9192.0"Dream""FEMALE""Chinstrap"
1924500.019.953.5205.0"Dream""MALE""Chinstrap"
1933950.019.549.0210.0"Dream""MALE""Chinstrap"
1943650.017.546.2187.0"Dream""FEMALE""Chinstrap"
1953550.019.150.9196.0"Dream""MALE""Chinstrap"
1963500.017.045.5196.0"Dream""FEMALE""Chinstrap"
1973675.017.950.9196.0"Dream""FEMALE""Chinstrap"
1984450.018.550.8201.0"Dream""MALE""Chinstrap"
1993400.017.950.1190.0"Dream""FEMALE""Chinstrap"
2004300.019.649.0212.0"Dream""MALE""Chinstrap"
2013250.018.751.5187.0"Dream""MALE""Chinstrap"
2023675.017.349.8198.0"Dream""FEMALE""Chinstrap"
2033325.016.448.1199.0"Dream""FEMALE""Chinstrap"
2043950.019.051.4201.0"Dream""MALE""Chinstrap"
2053600.017.345.7193.0"Dream""FEMALE""Chinstrap"
2064050.019.750.7203.0"Dream""MALE""Chinstrap"
2073350.017.342.5187.0"Dream""FEMALE""Chinstrap"
2083450.018.852.2197.0"Dream""MALE""Chinstrap"
2093250.016.645.2191.0"Dream""FEMALE""Chinstrap"
2104050.019.949.3203.0"Dream""MALE""Chinstrap"
2113800.018.850.2202.0"Dream""MALE""Chinstrap"
2123525.019.445.6194.0"Dream""FEMALE""Chinstrap"
2133950.019.551.9206.0"Dream""MALE""Chinstrap"
2143650.016.546.8189.0"Dream""FEMALE""Chinstrap"
2153650.017.045.7195.0"Dream""FEMALE""Chinstrap"
2164000.019.855.8207.0"Dream""MALE""Chinstrap"
2173400.018.143.5202.0"Dream""FEMALE""Chinstrap"
2183775.018.249.6193.0"Dream""MALE""Chinstrap"
2194100.019.050.8210.0"Dream""MALE""Chinstrap"
2203775.018.750.2198.0"Dream""FEMALE""Chinstrap"
2214500.013.246.1211.0"Biscoe""FEMALE""Gentoo"
2225700.016.350.0230.0"Biscoe""MALE""Gentoo"
2234450.014.148.7210.0"Biscoe""FEMALE""Gentoo"
2245700.015.250.0218.0"Biscoe""MALE""Gentoo"
2255400.014.547.6215.0"Biscoe""MALE""Gentoo"
2264550.013.546.5210.0"Biscoe""FEMALE""Gentoo"
2274800.014.645.4211.0"Biscoe""FEMALE""Gentoo"
2285200.015.346.7219.0"Biscoe""MALE""Gentoo"
2294400.013.443.3209.0"Biscoe""FEMALE""Gentoo"
2305150.015.446.8215.0"Biscoe""MALE""Gentoo"
2314650.013.740.9214.0"Biscoe""FEMALE""Gentoo"
2325550.016.149.0216.0"Biscoe""MALE""Gentoo"
2334650.013.745.5214.0"Biscoe""FEMALE""Gentoo"
2345850.014.648.4213.0"Biscoe""MALE""Gentoo"
2354200.014.645.8210.0"Biscoe""FEMALE""Gentoo"
2365850.015.749.3217.0"Biscoe""MALE""Gentoo"
2374150.013.542.0210.0"Biscoe""FEMALE""Gentoo"
2386300.015.249.2221.0"Biscoe""MALE""Gentoo"
2394800.014.546.2209.0"Biscoe""FEMALE""Gentoo"
2405350.015.148.7222.0"Biscoe""MALE""Gentoo"
2415700.014.350.2218.0"Biscoe""MALE""Gentoo"
2425000.014.545.1215.0"Biscoe""FEMALE""Gentoo"
2434400.014.546.5213.0"Biscoe""FEMALE""Gentoo"
2445050.015.846.3215.0"Biscoe""MALE""Gentoo"
2455000.013.142.9215.0"Biscoe""FEMALE""Gentoo"
2465100.015.146.1215.0"Biscoe""MALE""Gentoo"
2485650.015.047.8215.0"Biscoe""MALE""Gentoo"
2494600.014.348.2210.0"Biscoe""FEMALE""Gentoo"
2505550.015.350.0220.0"Biscoe""MALE""Gentoo"
2515250.015.347.3222.0"Biscoe""MALE""Gentoo"
2524700.014.242.8209.0"Biscoe""FEMALE""Gentoo"
2535050.014.545.1207.0"Biscoe""FEMALE""Gentoo"
2546050.017.059.6230.0"Biscoe""MALE""Gentoo"
2555150.014.849.1220.0"Biscoe""FEMALE""Gentoo"
2565400.016.348.4220.0"Biscoe""MALE""Gentoo"
2574950.013.742.6213.0"Biscoe""FEMALE""Gentoo"
2585250.017.344.4219.0"Biscoe""MALE""Gentoo"
2594350.013.644.0208.0"Biscoe""FEMALE""Gentoo"
2605350.015.748.7208.0"Biscoe""MALE""Gentoo"
2613950.013.742.7208.0"Biscoe""FEMALE""Gentoo"
2625700.016.049.6225.0"Biscoe""MALE""Gentoo"
2634300.013.745.3210.0"Biscoe""FEMALE""Gentoo"
2644750.015.049.6216.0"Biscoe""MALE""Gentoo"
2655550.015.950.5222.0"Biscoe""MALE""Gentoo"
2664900.013.943.6217.0"Biscoe""FEMALE""Gentoo"
2674200.013.945.5210.0"Biscoe""FEMALE""Gentoo"
2685400.015.950.5225.0"Biscoe""MALE""Gentoo"
2695100.013.344.9213.0"Biscoe""FEMALE""Gentoo"
2705300.015.845.2215.0"Biscoe""MALE""Gentoo"
2714850.014.246.6210.0"Biscoe""FEMALE""Gentoo"
2725300.014.148.5220.0"Biscoe""MALE""Gentoo"
2734400.014.445.1210.0"Biscoe""FEMALE""Gentoo"
2745000.015.050.1225.0"Biscoe""MALE""Gentoo"
2754900.014.446.5217.0"Biscoe""FEMALE""Gentoo"
2765050.015.445.0220.0"Biscoe""MALE""Gentoo"
2774300.013.943.8208.0"Biscoe""FEMALE""Gentoo"
2785000.015.045.5220.0"Biscoe""MALE""Gentoo"
2794450.014.543.2208.0"Biscoe""FEMALE""Gentoo"
2805550.015.350.4224.0"Biscoe""MALE""Gentoo"
2814200.013.845.3208.0"Biscoe""FEMALE""Gentoo"
2825300.014.946.2221.0"Biscoe""MALE""Gentoo"
2834400.013.945.7214.0"Biscoe""FEMALE""Gentoo"
2845650.015.754.3231.0"Biscoe""MALE""Gentoo"
2854700.014.245.8219.0"Biscoe""FEMALE""Gentoo"
2865700.016.849.8230.0"Biscoe""MALE""Gentoo"
2885800.016.249.5229.0"Biscoe""MALE""Gentoo"
2894700.014.243.5220.0"Biscoe""FEMALE""Gentoo"
2905550.015.050.7223.0"Biscoe""MALE""Gentoo"
2914750.015.047.7216.0"Biscoe""FEMALE""Gentoo"
2925000.015.646.4221.0"Biscoe""MALE""Gentoo"
2935100.015.648.2221.0"Biscoe""MALE""Gentoo"
2945200.014.846.5217.0"Biscoe""FEMALE""Gentoo"
2954700.015.046.4216.0"Biscoe""FEMALE""Gentoo"
2965800.016.048.6230.0"Biscoe""MALE""Gentoo"
2974600.014.247.5209.0"Biscoe""FEMALE""Gentoo"
2986000.016.351.1220.0"Biscoe""MALE""Gentoo"
2994750.013.845.2215.0"Biscoe""FEMALE""Gentoo"
3005950.016.445.2223.0"Biscoe""MALE""Gentoo"
3014625.014.549.1212.0"Biscoe""FEMALE""Gentoo"
3025450.015.652.5221.0"Biscoe""MALE""Gentoo"
3034725.014.647.4212.0"Biscoe""FEMALE""Gentoo"
3045350.015.950.0224.0"Biscoe""MALE""Gentoo"
3054750.013.844.9212.0"Biscoe""FEMALE""Gentoo"
3065600.017.350.8228.0"Biscoe""MALE""Gentoo"
3074600.014.443.4218.0"Biscoe""FEMALE""Gentoo"
3085300.014.251.3218.0"Biscoe""MALE""Gentoo"
3094875.014.047.5212.0"Biscoe""FEMALE""Gentoo"
3105550.017.052.1230.0"Biscoe""MALE""Gentoo"
3114950.015.047.5218.0"Biscoe""FEMALE""Gentoo"
3125400.017.152.2228.0"Biscoe""MALE""Gentoo"
3134750.014.545.5212.0"Biscoe""FEMALE""Gentoo"
3145650.016.149.5224.0"Biscoe""MALE""Gentoo"
3154850.014.744.5214.0"Biscoe""FEMALE""Gentoo"
3165200.015.750.8226.0"Biscoe""MALE""Gentoo"
3174925.015.849.4216.0"Biscoe""MALE""Gentoo"
3184875.014.646.9222.0"Biscoe""FEMALE""Gentoo"
3194625.014.448.4203.0"Biscoe""FEMALE""Gentoo"
3205250.016.551.1225.0"Biscoe""MALE""Gentoo"
3214850.015.048.5219.0"Biscoe""FEMALE""Gentoo"
3225600.017.055.9228.0"Biscoe""MALE""Gentoo"
3234975.015.547.2215.0"Biscoe""FEMALE""Gentoo"
3245500.015.049.1228.0"Biscoe""MALE""Gentoo"
3265500.016.146.8215.0"Biscoe""MALE""Gentoo"
3274700.014.741.7210.0"Biscoe""FEMALE""Gentoo"
3285500.015.853.4219.0"Biscoe""MALE""Gentoo"
3294575.014.043.3208.0"Biscoe""FEMALE""Gentoo"
3305500.015.148.1209.0"Biscoe""MALE""Gentoo"
3315000.015.250.5216.0"Biscoe""FEMALE""Gentoo"
3325950.015.949.8229.0"Biscoe""MALE""Gentoo"
3334650.015.243.5213.0"Biscoe""FEMALE""Gentoo"
3345500.016.351.5230.0"Biscoe""MALE""Gentoo"
3354375.014.146.2217.0"Biscoe""FEMALE""Gentoo"
3365850.016.055.1230.0"Biscoe""MALE""Gentoo"
3386000.016.248.8222.0"Biscoe""MALE""Gentoo"
3394925.013.747.2214.0"Biscoe""FEMALE""Gentoo"
3414850.014.346.8215.0"Biscoe""FEMALE""Gentoo"
3425750.015.750.4222.0"Biscoe""MALE""Gentoo"
3435200.014.845.2212.0"Biscoe""FEMALE""Gentoo"
3445400.016.149.9213.0"Biscoe""MALE""Gentoo"

Analyzing the Data

We can easily visualize the data we just loaded in different ways. For example, let’s take a look at the distribution of male and female penguins by species:

def output = vegalite:plot[
vegalite:bar[
:species,
{ :aggregate, "count" },
{ :data, penguin; :color, :sex; }
]
]
Penguin sex by species

Preparing the Data

Once we have the data loaded, we need to transform the data in order to feed them into the machine learning models.

In general, we support a variety of machine learning models. The complete list of supported models can be found in the Machine Learning Library.

Most of these models require two relations:

  • one containing the features to be used as inputs to train a model, and
  • one containing the response (or target) variable (or, class, in our case) that we want to learn to predict.

To this end, we put the feature data in the features relation and the class data (that are currently read as strings) in the response_string relation. Note that in the current implementation of the Machine Learning Library, the relation from which you extract the features (i.e., penguin) needs to be an extensional database (EDB) relation. This is done by using insert earlier when you defined the penguin relation.

def features = penguin[col]
for col in {
:island; :culmen_length_mm; :culmen_depth_mm;
:flipper_length_mm; :body_mass_g; :sex
}

def response_string = penguin:species

We can easily get statistics about our features data using describe:

table[describe[features]]

Relation: output

body_mass_gculmen_depth_mmculmen_length_mmflipper_length_mmislandsex
count333333333333333333
max6300.021.559.6231.0"Torgersen""MALE"
mean4207.05705705705717.1648648648648743.992792792792805200.96696696696696
min2700.013.132.1172.0"Biscoe""FEMALE"
percentile253550.015.639.5190.0
percentile504050.017.344.5197.0
percentile754775.018.748.6213.0
std805.21580194289641.96923546331990075.46866834264756114.015765288287879
mode"Biscoe""MALE"
mode_freq163168
unique32

and, of course, we can do the same for our response_string data:

table[(:response, describe_full[response_string])]

Relation: output

response
count333
max"Gentoo"
min"Adelie"
mode"Adelie"
mode_freq146
unique3

Here, we used describe_full because we have only one column in the response_string relation. Contrary to describe, describe_full provides statistics for the overall set of data rather than per feature.

Converting Class Names to Integers

You will use an mlpack classifier, so you need to represent the response classes specifically as integers. You cannot use strings or floats to represent the classes.

To this end, you will first identify all the unique classes. You can get them using last:

def classes = last[response_string]

Next, we add numbers as an id for each class. We can do this using sort, which sorts the classes and we can use the ordering index as the class id:

def id_class = sort[classes]
id_class

Relation: output

1"Adelie"
2"Chinstrap"
3"Gentoo"

In order to join with the relation response_string and get the ids, we need to swap the first and second columns. We can do this using transpose:

def class_id = transpose[id_class]

Note that transpose simply swaps the first and second columns, and is not to be confused with the typical matrix transposition. After we swap the columns, we can join with the response_string relation:

def response = response_string.class_id

Of course, we could have done all this in one step as follows:

def response = response_string.(transpose[sort[last[response_string]]])

Creating Training and Test Datasets

In classification (as well as other machine learning approaches), we use a “training” dataset to learn a classification model and a “test” dataset to determine the accuracy of our model. In certain cases, we may also use a validation dataset for parameter tuning, but we will consider only training and test for the purposes of this how-to guide.

Because the penguin dataset is not already split into training and test sets, we will have to create these two datasets.

In the following, we split our data into training and test sets with a ratio of 80/20. We specify the splitting ratio and the seed in split_params. The splitting is done by mlpack_preprocess_split, which splits the keys in the two sets. Afterwards, we join them with the features and response so that we generate the corresponding training and test data sets:

def split_params = {("test_ratio", "0.2"); ("seed", "42")}

def data_key(:keys, k) = features(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_params]

def feature_train(f, k, v) = features(f, k, v) and data_key_split(1, k)
def feature_test(f, k, v) = features(f, k, v) and data_key_split(2, k)

def response_train(k, v) = response(k, v) and data_key_split(1, k)
def response_test(k, v) = response(k, v) and data_key_split(2, k)

The relation split_params specifies the exact splitting ratio between training and test sets. Note that the parameter name as well as the value need to be encoded as strings.

At this point, we can also add various checks to ensure that we have included all the instances from the original data set when we did the splitting in training and test. For example, we can check that the number of instances in training and test add up:

ic all_data() {
count[feature_train] + count[feature_test] = count[features]
}

Or, we can more rigorously ensure that we have actually performed a split using all the available data:

ic all_features() {
equal(features, union[feature_train, feature_test])
}

Building a Classifier

In this guide, we will be using mlpack to create a decision tree classifier. The decision tree classifier of mlpack (as well as most of the other classifiers) can accept a set of optional parameters to tune the specific algorithm. The parameters for each classifier (aka hyper-parameters) are documented in the Machine Learning Library reference.

We set the hyper-parameters through a relation (we call it hyper_param here), as follows:

def hyper_param = {
("minimum_leaf_size", "10");
("minimum_gain_split", "1e-07")
}

Note that each classifier has its own parameters that you can find through the Machine Learning Library reference. Additionally, it is important to note that the parameters currently need to be passed as strings, similar to the example above. We can also pass no parameters to the classifier. In our example, we specified that we want the minimum number of instances in a leaf to be 10 and we set the minimum gain for node splitting to be 1e-07.

At this point, we are ready to build our classifier. We will use mlpack_decision_tree and specify the features for learning (i.e., the feature_train relation), the classes to learn to predict (i.e., the response_train relation), and the parameters:

def classifier = mlpack_decision_tree[
feature_train,
response_train,
hyper_param
]

Now we have a trained classifier with the relation classifier, which represents the model we have learned.

Performing Predictions

Our trained model classifier is now ready to make predictions. To make predictions, we have to use mlpack_decision_tree_predict, where we need to provide:

  1. the trained ML model,
  2. a relation with features similar to the one that was used for training, and
  3. a number that indicates the number of keys used in the feature relation.

The information about the number of keys is necessary because it defines the arity of the relation with the features used to perform the predictions. In our case, we have only one key: the csv file position, which we carried over from the data loading step.

We can predict the penguin species using the training dataset:

def prediction_train = mlpack_decision_tree_predict[
classifier,
feature_train,
1
]

We can, of course, also predict the penguin species of the unseen test dataset:

def prediction_test = mlpack_decision_tree_predict[
classifier,
feature_test,
1
]

Let’s look at some predictions for the test dataset:

top[5, prediction_test]

Relation: output

161
2202
3211
4341
5391

Evaluating Our Model

We can evaluate machine learning models using a variety of metrics. One popular way is the accuracy, which is defined as the fraction of the number of correct predictions over the total number of predictions.

We can compute the accuracy of the classifier model on the training dataset as follows:

def train_accuracy =
count[pos : prediction_train[pos] = response_train[pos]] /
count[response_train]
train_accuracy

Relation: output

0.947565543071161

Of course, what we really care about is the performance of our model on the test dataset:

def test_accuracy =
count[pos : prediction_test[pos] = response_test[pos]] /
count[response_test]
test_accuracy

Relation: output

0.9545454545454546

We can also compute precision and recall (aka sensitivity) metrics for each class

def score_precision[c] =
count[pos : prediction_test(pos, c) and response_test(pos, c)] /
count[pos : prediction_test(pos, c)]

def score_recall[c] =
count[pos : prediction_test(pos, c) and response_test(pos, c)] /
count[pos : response_test(pos, c)]

and query them.

score_precision

Relation: output

10.9642857142857143
20.8181818181818182
31.0
score_recall

Relation: output

10.9642857142857143
20.9
30.9642857142857143

With precision and recall metrics at hand, we can also compute the F1 score for each class:

def score_f1[c] =
2 * score_precision[c] * score_recall[c] /
(score_precision[c] + score_recall[c])

We can then query them.

score_f1

Relation: output

10.9642857142857143
20.8571428571428572
30.9818181818181818

Finally, we can compute the full confusion matrix (where actual is the actual class, or response, and predicted is the predicted class):

def confusion_matrix[predicted, actual] = count[
x : response_test(x, actual) and prediction_test(x, predicted)
]

When we query for it, we get:

confusion_matrix

Relation: output

1127
121
211
229
231
3327

Note that count does not return 0 for an empty relation, which means that if no data record of class actual was predicted to be of class predicted then this pair does not appear in confusion_matrix. This reflects the fundamental principle that, in Rel, missing data (or NULL in SQL) is not explicitly stored or represented.

To assign a zero count to these missing values, we simply need to explicitly define that for any missing predicted-actual class pair, (predicted, actual), we want to assign a count of 0. This is done below with the left_override (<++) operator:

table[
confusion_matrix[class_column.class_id, class_row.class_id] <++0
for class_column in classes,
class_row in classes
]

Relation: output

AdelieChinstrapGentoo
Adelie2710
Chinstrap190
Gentoo0127

Here we also convert the integer class IDs back to their original class names and state that we want the relation to be displayed as a wide table.

Training Multiple Classifiers

With Rel we can easily train and test multiple classifiers. Let’s consider an example.

We will train a set of classifiers on the same train and test datasets as before, but we will use a different set of hyper-parameters for each classifier. We will use a relation called hyper_param within a module called fine_tune to keep all the different hyper-parameter configurations:

module fine_tune
def hyper_param = {
("Classifier 1", {("minimum_leaf_size", "10"); ("minimum_gain_split", "1e-07")});
("Classifier 2", {("minimum_leaf_size", "20"); ("maximum_depth", "3")});
("Classifier 3", {("minimum_leaf_size", "5"); ("maximum_depth", "0")});
}
end

In hyper_param relation, we use an integer key (i.e, 1, 2, 3, ...) to identify each hyper-parameter configuration. This key will be useful to identify the classifiers from each configuration as well. We can now train multiple classifiers easily as follows:

module fine_tune
def classifier[i] = mlpack_decision_tree[
feature_train,
response_train,
hyper_param[i]
]
end

Note that the call to mlpack_decision_tree is the same as before, except we are iterating all the hyper-parameter configurations of the hyper_param.

We can now create predictions for each of the trained classifiers on the test set:

module fine_tune
def prediction_test[i] = mlpack_decision_tree_predict[
fine_tune:classifier[i],
feature_test,
1
]
end

And, as the next step, we can compute the precision for each classifier:

module fine_tune
def score_precision(i, cl, score) =
c = count[ pos :
fine_tune:prediction_test(i, pos, id) and
response_test(pos, id)
]
and n = count[pos : fine_tune:prediction_test(i, pos, id)]
and score = c/n
and id = class_id[cl]
from c, n, id
end
def output = table[fine_tune:score_precision]

Relation: output

Classifier 1Classifier 2Classifier 3
Adelie0.96428571428571430.96428571428571430.9333333333333333
Chinstrap0.81818181818181820.81818181818181821.0
Gentoo1.01.01.0

Finally, we can plot the performance of each classifier over some specific metric. For example, we can show the precision of each classifier for each of the three classes as follows:

def precision_plot_data[:[], i] = {
(:classifier_id, cid);
(:class, cl);
(:precision, pr)
}
from cid, cl, pr where sort[fine_tune:score_precision](i, cid, cl, pr)

def chart:data:values = precision_plot_data
def chart:mark = "bar"
def chart:width = 300
def chart = vegalite_utils:x[{
(:field, "class");
}]

def chart = vegalite_utils:y[{
(:field, "precision");
(:type, "quantitative");
(:axis, :format, ".3f");
}]

def chart:encoding:xOffset = { (:field, "classifier_id"); (:type, "nominal");}
def chart:encoding:color:field = "classifier_id"

def output = vegalite:plot[chart]
Classifier precision

Based on the analysis of performance of multiple classifiers, Rel allows us to easily determine which classifier is expected to perform the best. As an example, let’s pick the classifier with the maximum precision on the test set over all classes:

def score_precision_overall[i] =
count[pos : fine_tune:prediction_test[i, pos] = response_test[pos]] /
count[fine_tune:prediction_test[i]]

def max_precision_classifier_id = argmax[score_precision_overall]
def max_precision = score_precision_overall[max_precision_classifier_id]

def output:classifier = max_precision_classifier_id
def output:precision = max_precision

Relation: output

:classifier"Classifier 3"
:precision0.9696969696969697

Summary

We demonstrated the use of a decision tree classifier on our penguin dataset. More specifically, we used mlpack_decision_tree, i.e., a decision tree classifier from mlpack. We support other classifiers as well. For example:

In addition to mlpack, we also support other machine learning libraries such as glm or xgboost and we have more coming.

It is important to note here that all of our machine learning models are specifically designed to have the same API. In this way, we can easily swap machine learning models (of similar type, i.e., classification models). In our example in this guide we can simply switch mlpack_decision_tree with mlpack_random_forest, change the hyper_params to the right parameters for mlpack_random_forest (or just leave it empty to use the defaults), and we now have a random forest classifier.

In addition to the machine learning models, the Machine Learning Library also has useful functionality for other tasks. For example, we can perform k-nearest-neighbor search on a relation through mlpack_knn or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through mlpack_kernel_pca.

For a complete list of machine learning models and related functionality, see the Machine Learning Library.