Machine Learning: Classification
This How-To Guide demonstrates how to load a dataset, build a classification model, and perform predictions using that model.
Goal
The goal of this how-to guide is to provide an introduction to Rel’s machine learning functionality. As one part of a larger series of machine learning how-to guides, this guide will focus on classification. Specifically, we will explore how to load a dataset, build a classification model, and perform predictions using that model.
Preliminaries
We recommend that you also go through the CSV Import Guide and JSON Import and Export Guide, since they contain examples and functionality useful to understand how to appropriately load different kinds of data into the system.
Dataset
For this how-to guide we will be using the Palmer Archipelago (Antarctica) penguin data. We will use a copy of the penguin dataset located in our public S3 bucket.
This is a multivariate dataset with instances of penguins together with their features. We will be using the penguins_size.csv
file for our guide.
The dataset contains 344 instances of penguins from three species (classes), Adelie
, Chinstrap
and Gentoo
.
The Adelie
species contains 152 instances of penguins, Chinstrap
has 68, and Gentoo
has 124.
For each instance within the dataset, in addition to the species, there are 6 features:
Feature | Description | Type |
---|---|---|
island | The name of the island (Dream , Torgersen , or Biscoe ) in the Palmer Archipelago (Antarctica) where the penguin was found and measured | Categorical |
culmen_length_mm | The length of the penguin’s culmen in millimeters | Numerical |
culmen_depth_mm | The depth of the penguin’s culmen in millimeters | Numerical |
flipper_length_mm | The length of the penguin’s flippers in millimeters | Numerical |
body_mass_g | The body mass of the penguin in grams | Numerical |
sex | The sex (MALE , FEMALE ) of the penguin | Categorical |
Our goal in this guide is to build a classifier to predict the species of the penguin, given its features.
Here is a sample of the first five lines of the penguins_size.csv
file that we will be working with:
species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Adelie,Torgersen,39.1,18.7,181,3750,MALE
Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
Adelie,Torgersen,40.3,18,195,3250,FEMALE
Adelie,Torgersen,NA,NA,NA,NA,NA
Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
...
As you can see, there are certain instances of penguins where the data is not available (as denoted by NA
in the example above). To address this, we will be performing some data cleaning over the loaded data as we will discuss in a bit.
Loading the Data
Let’s begin building a classifier by loading the data from the file containing the penguin data. We can load the file using load_csv
as follows:
def config[:path] = "s3://relationalai-documentation-public/ml-classification/penguin/penguins_size.csv"
def config[:schema, :species] = "string"
def config[:schema, :island] = "string"
def config[:schema, :culmen_length_mm] = "float"
def config[:schema, :culmen_depth_mm] = "float"
def config[:schema, :flipper_length_mm] = "float"
def config[:schema, :body_mass_g] = "float"
def config[:schema, :sex] = "string"
// insert transaction
def insert[:penguin_raw] = lined_csv[load_csv[config]]
Note, in the code above, we have specified the path to the file which is located in our public AWS S3
bucket.
We used an s3://
url, which indicates a path to a public AWS bucket.
Additionally, we are reading the header from the file and we will use the header names as our feature names. Finally, we specified the schema of the imported file. Specifically, we indicated that the first two and last features (species
, island
, sex
) are of type string
, while the remaining (culmen_length_mm
, culmen_depth_mm
, flipper_length_mm
, body_mass_g
) are float
. In this guide we will learn to predict the species
feature.
Cleaning the Data
As we discussed in the previous section, there are certain instances (or lines) in the dataset that we would need to clean up. One such instance was shown earlier, where all the values where set to NA
. As a first step, Rel has already cleaned up these values for us. Since it wasn’t able to parse NA
as float, these instances were stored as load_errors
inside the penguin_raw
relation:
penguin_raw:load_errors
Relation: output
4 | 3 | "Adelie,Torgersen,NA,NA,NA,NA,NA" |
340 | 3 | "Gentoo,Biscoe,NA,NA,NA,NA,NA" |
As we can see from the file positions, there were two such lines having all of their features set to ‘NA’ in the dataset.
In addition to those errors, there are also a few lines where sex is defined as NA
(8 in total) and one line where sex is defined as .
. For the purpose of this guide we will drop all rows with an issue and we can get a clean dataset as follows:
def row_with_error(row) =
penguin_raw:sex(row, "NA") or
penguin_raw:sex(row, ".") or
penguin_raw:load_errors(row, _, _)
def insert[:penguin] = column, row, entry... :
penguin_raw(column, row, entry...) and not
row_with_error(row)
The final penguin dataset looks as follows:
table[penguin]
Relation: output
body_mass_g | culmen_depth_mm | culmen_length_mm | flipper_length_mm | island | sex | species | |
---|---|---|---|---|---|---|---|
1 | 3750.0 | 18.7 | 39.1 | 181.0 | "Torgersen" | "MALE" | "Adelie" |
2 | 3800.0 | 17.4 | 39.5 | 186.0 | "Torgersen" | "FEMALE" | "Adelie" |
3 | 3250.0 | 18.0 | 40.3 | 195.0 | "Torgersen" | "FEMALE" | "Adelie" |
5 | 3450.0 | 19.3 | 36.7 | 193.0 | "Torgersen" | "FEMALE" | "Adelie" |
6 | 3650.0 | 20.6 | 39.3 | 190.0 | "Torgersen" | "MALE" | "Adelie" |
7 | 3625.0 | 17.8 | 38.9 | 181.0 | "Torgersen" | "FEMALE" | "Adelie" |
8 | 4675.0 | 19.6 | 39.2 | 195.0 | "Torgersen" | "MALE" | "Adelie" |
13 | 3200.0 | 17.6 | 41.1 | 182.0 | "Torgersen" | "FEMALE" | "Adelie" |
14 | 3800.0 | 21.2 | 38.6 | 191.0 | "Torgersen" | "MALE" | "Adelie" |
15 | 4400.0 | 21.1 | 34.6 | 198.0 | "Torgersen" | "MALE" | "Adelie" |
16 | 3700.0 | 17.8 | 36.6 | 185.0 | "Torgersen" | "FEMALE" | "Adelie" |
17 | 3450.0 | 19.0 | 38.7 | 195.0 | "Torgersen" | "FEMALE" | "Adelie" |
18 | 4500.0 | 20.7 | 42.5 | 197.0 | "Torgersen" | "MALE" | "Adelie" |
19 | 3325.0 | 18.4 | 34.4 | 184.0 | "Torgersen" | "FEMALE" | "Adelie" |
20 | 4200.0 | 21.5 | 46.0 | 194.0 | "Torgersen" | "MALE" | "Adelie" |
21 | 3400.0 | 18.3 | 37.8 | 174.0 | "Biscoe" | "FEMALE" | "Adelie" |
22 | 3600.0 | 18.7 | 37.7 | 180.0 | "Biscoe" | "MALE" | "Adelie" |
23 | 3800.0 | 19.2 | 35.9 | 189.0 | "Biscoe" | "FEMALE" | "Adelie" |
24 | 3950.0 | 18.1 | 38.2 | 185.0 | "Biscoe" | "MALE" | "Adelie" |
25 | 3800.0 | 17.2 | 38.8 | 180.0 | "Biscoe" | "MALE" | "Adelie" |
26 | 3800.0 | 18.9 | 35.3 | 187.0 | "Biscoe" | "FEMALE" | "Adelie" |
27 | 3550.0 | 18.6 | 40.6 | 183.0 | "Biscoe" | "MALE" | "Adelie" |
28 | 3200.0 | 17.9 | 40.5 | 187.0 | "Biscoe" | "FEMALE" | "Adelie" |
29 | 3150.0 | 18.6 | 37.9 | 172.0 | "Biscoe" | "FEMALE" | "Adelie" |
30 | 3950.0 | 18.9 | 40.5 | 180.0 | "Biscoe" | "MALE" | "Adelie" |
31 | 3250.0 | 16.7 | 39.5 | 178.0 | "Dream" | "FEMALE" | "Adelie" |
32 | 3900.0 | 18.1 | 37.2 | 178.0 | "Dream" | "MALE" | "Adelie" |
33 | 3300.0 | 17.8 | 39.5 | 188.0 | "Dream" | "FEMALE" | "Adelie" |
34 | 3900.0 | 18.9 | 40.9 | 184.0 | "Dream" | "MALE" | "Adelie" |
35 | 3325.0 | 17.0 | 36.4 | 195.0 | "Dream" | "FEMALE" | "Adelie" |
36 | 4150.0 | 21.1 | 39.2 | 196.0 | "Dream" | "MALE" | "Adelie" |
37 | 3950.0 | 20.0 | 38.8 | 190.0 | "Dream" | "MALE" | "Adelie" |
38 | 3550.0 | 18.5 | 42.2 | 180.0 | "Dream" | "FEMALE" | "Adelie" |
39 | 3300.0 | 19.3 | 37.6 | 181.0 | "Dream" | "FEMALE" | "Adelie" |
40 | 4650.0 | 19.1 | 39.8 | 184.0 | "Dream" | "MALE" | "Adelie" |
41 | 3150.0 | 18.0 | 36.5 | 182.0 | "Dream" | "FEMALE" | "Adelie" |
42 | 3900.0 | 18.4 | 40.8 | 195.0 | "Dream" | "MALE" | "Adelie" |
43 | 3100.0 | 18.5 | 36.0 | 186.0 | "Dream" | "FEMALE" | "Adelie" |
44 | 4400.0 | 19.7 | 44.1 | 196.0 | "Dream" | "MALE" | "Adelie" |
45 | 3000.0 | 16.9 | 37.0 | 185.0 | "Dream" | "FEMALE" | "Adelie" |
46 | 4600.0 | 18.8 | 39.6 | 190.0 | "Dream" | "MALE" | "Adelie" |
47 | 3425.0 | 19.0 | 41.1 | 182.0 | "Dream" | "MALE" | "Adelie" |
49 | 3450.0 | 17.9 | 36.0 | 190.0 | "Dream" | "FEMALE" | "Adelie" |
50 | 4150.0 | 21.2 | 42.3 | 191.0 | "Dream" | "MALE" | "Adelie" |
51 | 3500.0 | 17.7 | 39.6 | 186.0 | "Biscoe" | "FEMALE" | "Adelie" |
52 | 4300.0 | 18.9 | 40.1 | 188.0 | "Biscoe" | "MALE" | "Adelie" |
53 | 3450.0 | 17.9 | 35.0 | 190.0 | "Biscoe" | "FEMALE" | "Adelie" |
54 | 4050.0 | 19.5 | 42.0 | 200.0 | "Biscoe" | "MALE" | "Adelie" |
55 | 2900.0 | 18.1 | 34.5 | 187.0 | "Biscoe" | "FEMALE" | "Adelie" |
56 | 3700.0 | 18.6 | 41.4 | 191.0 | "Biscoe" | "MALE" | "Adelie" |
57 | 3550.0 | 17.5 | 39.0 | 186.0 | "Biscoe" | "FEMALE" | "Adelie" |
58 | 3800.0 | 18.8 | 40.6 | 193.0 | "Biscoe" | "MALE" | "Adelie" |
59 | 2850.0 | 16.6 | 36.5 | 181.0 | "Biscoe" | "FEMALE" | "Adelie" |
60 | 3750.0 | 19.1 | 37.6 | 194.0 | "Biscoe" | "MALE" | "Adelie" |
61 | 3150.0 | 16.9 | 35.7 | 185.0 | "Biscoe" | "FEMALE" | "Adelie" |
62 | 4400.0 | 21.1 | 41.3 | 195.0 | "Biscoe" | "MALE" | "Adelie" |
63 | 3600.0 | 17.0 | 37.6 | 185.0 | "Biscoe" | "FEMALE" | "Adelie" |
64 | 4050.0 | 18.2 | 41.1 | 192.0 | "Biscoe" | "MALE" | "Adelie" |
65 | 2850.0 | 17.1 | 36.4 | 184.0 | "Biscoe" | "FEMALE" | "Adelie" |
66 | 3950.0 | 18.0 | 41.6 | 192.0 | "Biscoe" | "MALE" | "Adelie" |
67 | 3350.0 | 16.2 | 35.5 | 195.0 | "Biscoe" | "FEMALE" | "Adelie" |
68 | 4100.0 | 19.1 | 41.1 | 188.0 | "Biscoe" | "MALE" | "Adelie" |
69 | 3050.0 | 16.6 | 35.9 | 190.0 | "Torgersen" | "FEMALE" | "Adelie" |
70 | 4450.0 | 19.4 | 41.8 | 198.0 | "Torgersen" | "MALE" | "Adelie" |
71 | 3600.0 | 19.0 | 33.5 | 190.0 | "Torgersen" | "FEMALE" | "Adelie" |
72 | 3900.0 | 18.4 | 39.7 | 190.0 | "Torgersen" | "MALE" | "Adelie" |
73 | 3550.0 | 17.2 | 39.6 | 196.0 | "Torgersen" | "FEMALE" | "Adelie" |
74 | 4150.0 | 18.9 | 45.8 | 197.0 | "Torgersen" | "MALE" | "Adelie" |
75 | 3700.0 | 17.5 | 35.5 | 190.0 | "Torgersen" | "FEMALE" | "Adelie" |
76 | 4250.0 | 18.5 | 42.8 | 195.0 | "Torgersen" | "MALE" | "Adelie" |
77 | 3700.0 | 16.8 | 40.9 | 191.0 | "Torgersen" | "FEMALE" | "Adelie" |
78 | 3900.0 | 19.4 | 37.2 | 184.0 | "Torgersen" | "MALE" | "Adelie" |
79 | 3550.0 | 16.1 | 36.2 | 187.0 | "Torgersen" | "FEMALE" | "Adelie" |
80 | 4000.0 | 19.1 | 42.1 | 195.0 | "Torgersen" | "MALE" | "Adelie" |
81 | 3200.0 | 17.2 | 34.6 | 189.0 | "Torgersen" | "FEMALE" | "Adelie" |
82 | 4700.0 | 17.6 | 42.9 | 196.0 | "Torgersen" | "MALE" | "Adelie" |
83 | 3800.0 | 18.8 | 36.7 | 187.0 | "Torgersen" | "FEMALE" | "Adelie" |
84 | 4200.0 | 19.4 | 35.1 | 193.0 | "Torgersen" | "MALE" | "Adelie" |
85 | 3350.0 | 17.8 | 37.3 | 191.0 | "Dream" | "FEMALE" | "Adelie" |
86 | 3550.0 | 20.3 | 41.3 | 194.0 | "Dream" | "MALE" | "Adelie" |
87 | 3800.0 | 19.5 | 36.3 | 190.0 | "Dream" | "MALE" | "Adelie" |
88 | 3500.0 | 18.6 | 36.9 | 189.0 | "Dream" | "FEMALE" | "Adelie" |
89 | 3950.0 | 19.2 | 38.3 | 189.0 | "Dream" | "MALE" | "Adelie" |
90 | 3600.0 | 18.8 | 38.9 | 190.0 | "Dream" | "FEMALE" | "Adelie" |
91 | 3550.0 | 18.0 | 35.7 | 202.0 | "Dream" | "FEMALE" | "Adelie" |
92 | 4300.0 | 18.1 | 41.1 | 205.0 | "Dream" | "MALE" | "Adelie" |
93 | 3400.0 | 17.1 | 34.0 | 185.0 | "Dream" | "FEMALE" | "Adelie" |
94 | 4450.0 | 18.1 | 39.6 | 186.0 | "Dream" | "MALE" | "Adelie" |
95 | 3300.0 | 17.3 | 36.2 | 187.0 | "Dream" | "FEMALE" | "Adelie" |
96 | 4300.0 | 18.9 | 40.8 | 208.0 | "Dream" | "MALE" | "Adelie" |
97 | 3700.0 | 18.6 | 38.1 | 190.0 | "Dream" | "FEMALE" | "Adelie" |
98 | 4350.0 | 18.5 | 40.3 | 196.0 | "Dream" | "MALE" | "Adelie" |
99 | 2900.0 | 16.1 | 33.1 | 178.0 | "Dream" | "FEMALE" | "Adelie" |
100 | 4100.0 | 18.5 | 43.2 | 192.0 | "Dream" | "MALE" | "Adelie" |
101 | 3725.0 | 17.9 | 35.0 | 192.0 | "Biscoe" | "FEMALE" | "Adelie" |
102 | 4725.0 | 20.0 | 41.0 | 203.0 | "Biscoe" | "MALE" | "Adelie" |
103 | 3075.0 | 16.0 | 37.7 | 183.0 | "Biscoe" | "FEMALE" | "Adelie" |
104 | 4250.0 | 20.0 | 37.8 | 190.0 | "Biscoe" | "MALE" | "Adelie" |
105 | 2925.0 | 18.6 | 37.9 | 193.0 | "Biscoe" | "FEMALE" | "Adelie" |
106 | 3550.0 | 18.9 | 39.7 | 184.0 | "Biscoe" | "MALE" | "Adelie" |
107 | 3750.0 | 17.2 | 38.6 | 199.0 | "Biscoe" | "FEMALE" | "Adelie" |
108 | 3900.0 | 20.0 | 38.2 | 190.0 | "Biscoe" | "MALE" | "Adelie" |
109 | 3175.0 | 17.0 | 38.1 | 181.0 | "Biscoe" | "FEMALE" | "Adelie" |
110 | 4775.0 | 19.0 | 43.2 | 197.0 | "Biscoe" | "MALE" | "Adelie" |
111 | 3825.0 | 16.5 | 38.1 | 198.0 | "Biscoe" | "FEMALE" | "Adelie" |
112 | 4600.0 | 20.3 | 45.6 | 191.0 | "Biscoe" | "MALE" | "Adelie" |
113 | 3200.0 | 17.7 | 39.7 | 193.0 | "Biscoe" | "FEMALE" | "Adelie" |
114 | 4275.0 | 19.5 | 42.2 | 197.0 | "Biscoe" | "MALE" | "Adelie" |
115 | 3900.0 | 20.7 | 39.6 | 191.0 | "Biscoe" | "FEMALE" | "Adelie" |
116 | 4075.0 | 18.3 | 42.7 | 196.0 | "Biscoe" | "MALE" | "Adelie" |
117 | 2900.0 | 17.0 | 38.6 | 188.0 | "Torgersen" | "FEMALE" | "Adelie" |
118 | 3775.0 | 20.5 | 37.3 | 199.0 | "Torgersen" | "MALE" | "Adelie" |
119 | 3350.0 | 17.0 | 35.7 | 189.0 | "Torgersen" | "FEMALE" | "Adelie" |
120 | 3325.0 | 18.6 | 41.1 | 189.0 | "Torgersen" | "MALE" | "Adelie" |
121 | 3150.0 | 17.2 | 36.2 | 187.0 | "Torgersen" | "FEMALE" | "Adelie" |
122 | 3500.0 | 19.8 | 37.7 | 198.0 | "Torgersen" | "MALE" | "Adelie" |
123 | 3450.0 | 17.0 | 40.2 | 176.0 | "Torgersen" | "FEMALE" | "Adelie" |
124 | 3875.0 | 18.5 | 41.4 | 202.0 | "Torgersen" | "MALE" | "Adelie" |
125 | 3050.0 | 15.9 | 35.2 | 186.0 | "Torgersen" | "FEMALE" | "Adelie" |
126 | 4000.0 | 19.0 | 40.6 | 199.0 | "Torgersen" | "MALE" | "Adelie" |
127 | 3275.0 | 17.6 | 38.8 | 191.0 | "Torgersen" | "FEMALE" | "Adelie" |
128 | 4300.0 | 18.3 | 41.5 | 195.0 | "Torgersen" | "MALE" | "Adelie" |
129 | 3050.0 | 17.1 | 39.0 | 191.0 | "Torgersen" | "FEMALE" | "Adelie" |
130 | 4000.0 | 18.0 | 44.1 | 210.0 | "Torgersen" | "MALE" | "Adelie" |
131 | 3325.0 | 17.9 | 38.5 | 190.0 | "Torgersen" | "FEMALE" | "Adelie" |
132 | 3500.0 | 19.2 | 43.1 | 197.0 | "Torgersen" | "MALE" | "Adelie" |
133 | 3500.0 | 18.5 | 36.8 | 193.0 | "Dream" | "FEMALE" | "Adelie" |
134 | 4475.0 | 18.5 | 37.5 | 199.0 | "Dream" | "MALE" | "Adelie" |
135 | 3425.0 | 17.6 | 38.1 | 187.0 | "Dream" | "FEMALE" | "Adelie" |
136 | 3900.0 | 17.5 | 41.1 | 190.0 | "Dream" | "MALE" | "Adelie" |
137 | 3175.0 | 17.5 | 35.6 | 191.0 | "Dream" | "FEMALE" | "Adelie" |
138 | 3975.0 | 20.1 | 40.2 | 200.0 | "Dream" | "MALE" | "Adelie" |
139 | 3400.0 | 16.5 | 37.0 | 185.0 | "Dream" | "FEMALE" | "Adelie" |
140 | 4250.0 | 17.9 | 39.7 | 193.0 | "Dream" | "MALE" | "Adelie" |
141 | 3400.0 | 17.1 | 40.2 | 193.0 | "Dream" | "FEMALE" | "Adelie" |
142 | 3475.0 | 17.2 | 40.6 | 187.0 | "Dream" | "MALE" | "Adelie" |
143 | 3050.0 | 15.5 | 32.1 | 188.0 | "Dream" | "FEMALE" | "Adelie" |
144 | 3725.0 | 17.0 | 40.7 | 190.0 | "Dream" | "MALE" | "Adelie" |
145 | 3000.0 | 16.8 | 37.3 | 192.0 | "Dream" | "FEMALE" | "Adelie" |
146 | 3650.0 | 18.7 | 39.0 | 185.0 | "Dream" | "MALE" | "Adelie" |
147 | 4250.0 | 18.6 | 39.2 | 190.0 | "Dream" | "MALE" | "Adelie" |
148 | 3475.0 | 18.4 | 36.6 | 184.0 | "Dream" | "FEMALE" | "Adelie" |
149 | 3450.0 | 17.8 | 36.0 | 195.0 | "Dream" | "FEMALE" | "Adelie" |
150 | 3750.0 | 18.1 | 37.8 | 193.0 | "Dream" | "MALE" | "Adelie" |
151 | 3700.0 | 17.1 | 36.0 | 187.0 | "Dream" | "FEMALE" | "Adelie" |
152 | 4000.0 | 18.5 | 41.5 | 201.0 | "Dream" | "MALE" | "Adelie" |
153 | 3500.0 | 17.9 | 46.5 | 192.0 | "Dream" | "FEMALE" | "Chinstrap" |
154 | 3900.0 | 19.5 | 50.0 | 196.0 | "Dream" | "MALE" | "Chinstrap" |
155 | 3650.0 | 19.2 | 51.3 | 193.0 | "Dream" | "MALE" | "Chinstrap" |
156 | 3525.0 | 18.7 | 45.4 | 188.0 | "Dream" | "FEMALE" | "Chinstrap" |
157 | 3725.0 | 19.8 | 52.7 | 197.0 | "Dream" | "MALE" | "Chinstrap" |
158 | 3950.0 | 17.8 | 45.2 | 198.0 | "Dream" | "FEMALE" | "Chinstrap" |
159 | 3250.0 | 18.2 | 46.1 | 178.0 | "Dream" | "FEMALE" | "Chinstrap" |
160 | 3750.0 | 18.2 | 51.3 | 197.0 | "Dream" | "MALE" | "Chinstrap" |
161 | 4150.0 | 18.9 | 46.0 | 195.0 | "Dream" | "FEMALE" | "Chinstrap" |
162 | 3700.0 | 19.9 | 51.3 | 198.0 | "Dream" | "MALE" | "Chinstrap" |
163 | 3800.0 | 17.8 | 46.6 | 193.0 | "Dream" | "FEMALE" | "Chinstrap" |
164 | 3775.0 | 20.3 | 51.7 | 194.0 | "Dream" | "MALE" | "Chinstrap" |
165 | 3700.0 | 17.3 | 47.0 | 185.0 | "Dream" | "FEMALE" | "Chinstrap" |
166 | 4050.0 | 18.1 | 52.0 | 201.0 | "Dream" | "MALE" | "Chinstrap" |
167 | 3575.0 | 17.1 | 45.9 | 190.0 | "Dream" | "FEMALE" | "Chinstrap" |
168 | 4050.0 | 19.6 | 50.5 | 201.0 | "Dream" | "MALE" | "Chinstrap" |
169 | 3300.0 | 20.0 | 50.3 | 197.0 | "Dream" | "MALE" | "Chinstrap" |
170 | 3700.0 | 17.8 | 58.0 | 181.0 | "Dream" | "FEMALE" | "Chinstrap" |
171 | 3450.0 | 18.6 | 46.4 | 190.0 | "Dream" | "FEMALE" | "Chinstrap" |
172 | 4400.0 | 18.2 | 49.2 | 195.0 | "Dream" | "MALE" | "Chinstrap" |
173 | 3600.0 | 17.3 | 42.4 | 181.0 | "Dream" | "FEMALE" | "Chinstrap" |
174 | 3400.0 | 17.5 | 48.5 | 191.0 | "Dream" | "MALE" | "Chinstrap" |
175 | 2900.0 | 16.6 | 43.2 | 187.0 | "Dream" | "FEMALE" | "Chinstrap" |
176 | 3800.0 | 19.4 | 50.6 | 193.0 | "Dream" | "MALE" | "Chinstrap" |
177 | 3300.0 | 17.9 | 46.7 | 195.0 | "Dream" | "FEMALE" | "Chinstrap" |
178 | 4150.0 | 19.0 | 52.0 | 197.0 | "Dream" | "MALE" | "Chinstrap" |
179 | 3400.0 | 18.4 | 50.5 | 200.0 | "Dream" | "FEMALE" | "Chinstrap" |
180 | 3800.0 | 19.0 | 49.5 | 200.0 | "Dream" | "MALE" | "Chinstrap" |
181 | 3700.0 | 17.8 | 46.4 | 191.0 | "Dream" | "FEMALE" | "Chinstrap" |
182 | 4550.0 | 20.0 | 52.8 | 205.0 | "Dream" | "MALE" | "Chinstrap" |
183 | 3200.0 | 16.6 | 40.9 | 187.0 | "Dream" | "FEMALE" | "Chinstrap" |
184 | 4300.0 | 20.8 | 54.2 | 201.0 | "Dream" | "MALE" | "Chinstrap" |
185 | 3350.0 | 16.7 | 42.5 | 187.0 | "Dream" | "FEMALE" | "Chinstrap" |
186 | 4100.0 | 18.8 | 51.0 | 203.0 | "Dream" | "MALE" | "Chinstrap" |
187 | 3600.0 | 18.6 | 49.7 | 195.0 | "Dream" | "MALE" | "Chinstrap" |
188 | 3900.0 | 16.8 | 47.5 | 199.0 | "Dream" | "FEMALE" | "Chinstrap" |
189 | 3850.0 | 18.3 | 47.6 | 195.0 | "Dream" | "FEMALE" | "Chinstrap" |
190 | 4800.0 | 20.7 | 52.0 | 210.0 | "Dream" | "MALE" | "Chinstrap" |
191 | 2700.0 | 16.6 | 46.9 | 192.0 | "Dream" | "FEMALE" | "Chinstrap" |
192 | 4500.0 | 19.9 | 53.5 | 205.0 | "Dream" | "MALE" | "Chinstrap" |
193 | 3950.0 | 19.5 | 49.0 | 210.0 | "Dream" | "MALE" | "Chinstrap" |
194 | 3650.0 | 17.5 | 46.2 | 187.0 | "Dream" | "FEMALE" | "Chinstrap" |
195 | 3550.0 | 19.1 | 50.9 | 196.0 | "Dream" | "MALE" | "Chinstrap" |
196 | 3500.0 | 17.0 | 45.5 | 196.0 | "Dream" | "FEMALE" | "Chinstrap" |
197 | 3675.0 | 17.9 | 50.9 | 196.0 | "Dream" | "FEMALE" | "Chinstrap" |
198 | 4450.0 | 18.5 | 50.8 | 201.0 | "Dream" | "MALE" | "Chinstrap" |
199 | 3400.0 | 17.9 | 50.1 | 190.0 | "Dream" | "FEMALE" | "Chinstrap" |
200 | 4300.0 | 19.6 | 49.0 | 212.0 | "Dream" | "MALE" | "Chinstrap" |
201 | 3250.0 | 18.7 | 51.5 | 187.0 | "Dream" | "MALE" | "Chinstrap" |
202 | 3675.0 | 17.3 | 49.8 | 198.0 | "Dream" | "FEMALE" | "Chinstrap" |
203 | 3325.0 | 16.4 | 48.1 | 199.0 | "Dream" | "FEMALE" | "Chinstrap" |
204 | 3950.0 | 19.0 | 51.4 | 201.0 | "Dream" | "MALE" | "Chinstrap" |
205 | 3600.0 | 17.3 | 45.7 | 193.0 | "Dream" | "FEMALE" | "Chinstrap" |
206 | 4050.0 | 19.7 | 50.7 | 203.0 | "Dream" | "MALE" | "Chinstrap" |
207 | 3350.0 | 17.3 | 42.5 | 187.0 | "Dream" | "FEMALE" | "Chinstrap" |
208 | 3450.0 | 18.8 | 52.2 | 197.0 | "Dream" | "MALE" | "Chinstrap" |
209 | 3250.0 | 16.6 | 45.2 | 191.0 | "Dream" | "FEMALE" | "Chinstrap" |
210 | 4050.0 | 19.9 | 49.3 | 203.0 | "Dream" | "MALE" | "Chinstrap" |
211 | 3800.0 | 18.8 | 50.2 | 202.0 | "Dream" | "MALE" | "Chinstrap" |
212 | 3525.0 | 19.4 | 45.6 | 194.0 | "Dream" | "FEMALE" | "Chinstrap" |
213 | 3950.0 | 19.5 | 51.9 | 206.0 | "Dream" | "MALE" | "Chinstrap" |
214 | 3650.0 | 16.5 | 46.8 | 189.0 | "Dream" | "FEMALE" | "Chinstrap" |
215 | 3650.0 | 17.0 | 45.7 | 195.0 | "Dream" | "FEMALE" | "Chinstrap" |
216 | 4000.0 | 19.8 | 55.8 | 207.0 | "Dream" | "MALE" | "Chinstrap" |
217 | 3400.0 | 18.1 | 43.5 | 202.0 | "Dream" | "FEMALE" | "Chinstrap" |
218 | 3775.0 | 18.2 | 49.6 | 193.0 | "Dream" | "MALE" | "Chinstrap" |
219 | 4100.0 | 19.0 | 50.8 | 210.0 | "Dream" | "MALE" | "Chinstrap" |
220 | 3775.0 | 18.7 | 50.2 | 198.0 | "Dream" | "FEMALE" | "Chinstrap" |
221 | 4500.0 | 13.2 | 46.1 | 211.0 | "Biscoe" | "FEMALE" | "Gentoo" |
222 | 5700.0 | 16.3 | 50.0 | 230.0 | "Biscoe" | "MALE" | "Gentoo" |
223 | 4450.0 | 14.1 | 48.7 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
224 | 5700.0 | 15.2 | 50.0 | 218.0 | "Biscoe" | "MALE" | "Gentoo" |
225 | 5400.0 | 14.5 | 47.6 | 215.0 | "Biscoe" | "MALE" | "Gentoo" |
226 | 4550.0 | 13.5 | 46.5 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
227 | 4800.0 | 14.6 | 45.4 | 211.0 | "Biscoe" | "FEMALE" | "Gentoo" |
228 | 5200.0 | 15.3 | 46.7 | 219.0 | "Biscoe" | "MALE" | "Gentoo" |
229 | 4400.0 | 13.4 | 43.3 | 209.0 | "Biscoe" | "FEMALE" | "Gentoo" |
230 | 5150.0 | 15.4 | 46.8 | 215.0 | "Biscoe" | "MALE" | "Gentoo" |
231 | 4650.0 | 13.7 | 40.9 | 214.0 | "Biscoe" | "FEMALE" | "Gentoo" |
232 | 5550.0 | 16.1 | 49.0 | 216.0 | "Biscoe" | "MALE" | "Gentoo" |
233 | 4650.0 | 13.7 | 45.5 | 214.0 | "Biscoe" | "FEMALE" | "Gentoo" |
234 | 5850.0 | 14.6 | 48.4 | 213.0 | "Biscoe" | "MALE" | "Gentoo" |
235 | 4200.0 | 14.6 | 45.8 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
236 | 5850.0 | 15.7 | 49.3 | 217.0 | "Biscoe" | "MALE" | "Gentoo" |
237 | 4150.0 | 13.5 | 42.0 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
238 | 6300.0 | 15.2 | 49.2 | 221.0 | "Biscoe" | "MALE" | "Gentoo" |
239 | 4800.0 | 14.5 | 46.2 | 209.0 | "Biscoe" | "FEMALE" | "Gentoo" |
240 | 5350.0 | 15.1 | 48.7 | 222.0 | "Biscoe" | "MALE" | "Gentoo" |
241 | 5700.0 | 14.3 | 50.2 | 218.0 | "Biscoe" | "MALE" | "Gentoo" |
242 | 5000.0 | 14.5 | 45.1 | 215.0 | "Biscoe" | "FEMALE" | "Gentoo" |
243 | 4400.0 | 14.5 | 46.5 | 213.0 | "Biscoe" | "FEMALE" | "Gentoo" |
244 | 5050.0 | 15.8 | 46.3 | 215.0 | "Biscoe" | "MALE" | "Gentoo" |
245 | 5000.0 | 13.1 | 42.9 | 215.0 | "Biscoe" | "FEMALE" | "Gentoo" |
246 | 5100.0 | 15.1 | 46.1 | 215.0 | "Biscoe" | "MALE" | "Gentoo" |
248 | 5650.0 | 15.0 | 47.8 | 215.0 | "Biscoe" | "MALE" | "Gentoo" |
249 | 4600.0 | 14.3 | 48.2 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
250 | 5550.0 | 15.3 | 50.0 | 220.0 | "Biscoe" | "MALE" | "Gentoo" |
251 | 5250.0 | 15.3 | 47.3 | 222.0 | "Biscoe" | "MALE" | "Gentoo" |
252 | 4700.0 | 14.2 | 42.8 | 209.0 | "Biscoe" | "FEMALE" | "Gentoo" |
253 | 5050.0 | 14.5 | 45.1 | 207.0 | "Biscoe" | "FEMALE" | "Gentoo" |
254 | 6050.0 | 17.0 | 59.6 | 230.0 | "Biscoe" | "MALE" | "Gentoo" |
255 | 5150.0 | 14.8 | 49.1 | 220.0 | "Biscoe" | "FEMALE" | "Gentoo" |
256 | 5400.0 | 16.3 | 48.4 | 220.0 | "Biscoe" | "MALE" | "Gentoo" |
257 | 4950.0 | 13.7 | 42.6 | 213.0 | "Biscoe" | "FEMALE" | "Gentoo" |
258 | 5250.0 | 17.3 | 44.4 | 219.0 | "Biscoe" | "MALE" | "Gentoo" |
259 | 4350.0 | 13.6 | 44.0 | 208.0 | "Biscoe" | "FEMALE" | "Gentoo" |
260 | 5350.0 | 15.7 | 48.7 | 208.0 | "Biscoe" | "MALE" | "Gentoo" |
261 | 3950.0 | 13.7 | 42.7 | 208.0 | "Biscoe" | "FEMALE" | "Gentoo" |
262 | 5700.0 | 16.0 | 49.6 | 225.0 | "Biscoe" | "MALE" | "Gentoo" |
263 | 4300.0 | 13.7 | 45.3 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
264 | 4750.0 | 15.0 | 49.6 | 216.0 | "Biscoe" | "MALE" | "Gentoo" |
265 | 5550.0 | 15.9 | 50.5 | 222.0 | "Biscoe" | "MALE" | "Gentoo" |
266 | 4900.0 | 13.9 | 43.6 | 217.0 | "Biscoe" | "FEMALE" | "Gentoo" |
267 | 4200.0 | 13.9 | 45.5 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
268 | 5400.0 | 15.9 | 50.5 | 225.0 | "Biscoe" | "MALE" | "Gentoo" |
269 | 5100.0 | 13.3 | 44.9 | 213.0 | "Biscoe" | "FEMALE" | "Gentoo" |
270 | 5300.0 | 15.8 | 45.2 | 215.0 | "Biscoe" | "MALE" | "Gentoo" |
271 | 4850.0 | 14.2 | 46.6 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
272 | 5300.0 | 14.1 | 48.5 | 220.0 | "Biscoe" | "MALE" | "Gentoo" |
273 | 4400.0 | 14.4 | 45.1 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
274 | 5000.0 | 15.0 | 50.1 | 225.0 | "Biscoe" | "MALE" | "Gentoo" |
275 | 4900.0 | 14.4 | 46.5 | 217.0 | "Biscoe" | "FEMALE" | "Gentoo" |
276 | 5050.0 | 15.4 | 45.0 | 220.0 | "Biscoe" | "MALE" | "Gentoo" |
277 | 4300.0 | 13.9 | 43.8 | 208.0 | "Biscoe" | "FEMALE" | "Gentoo" |
278 | 5000.0 | 15.0 | 45.5 | 220.0 | "Biscoe" | "MALE" | "Gentoo" |
279 | 4450.0 | 14.5 | 43.2 | 208.0 | "Biscoe" | "FEMALE" | "Gentoo" |
280 | 5550.0 | 15.3 | 50.4 | 224.0 | "Biscoe" | "MALE" | "Gentoo" |
281 | 4200.0 | 13.8 | 45.3 | 208.0 | "Biscoe" | "FEMALE" | "Gentoo" |
282 | 5300.0 | 14.9 | 46.2 | 221.0 | "Biscoe" | "MALE" | "Gentoo" |
283 | 4400.0 | 13.9 | 45.7 | 214.0 | "Biscoe" | "FEMALE" | "Gentoo" |
284 | 5650.0 | 15.7 | 54.3 | 231.0 | "Biscoe" | "MALE" | "Gentoo" |
285 | 4700.0 | 14.2 | 45.8 | 219.0 | "Biscoe" | "FEMALE" | "Gentoo" |
286 | 5700.0 | 16.8 | 49.8 | 230.0 | "Biscoe" | "MALE" | "Gentoo" |
288 | 5800.0 | 16.2 | 49.5 | 229.0 | "Biscoe" | "MALE" | "Gentoo" |
289 | 4700.0 | 14.2 | 43.5 | 220.0 | "Biscoe" | "FEMALE" | "Gentoo" |
290 | 5550.0 | 15.0 | 50.7 | 223.0 | "Biscoe" | "MALE" | "Gentoo" |
291 | 4750.0 | 15.0 | 47.7 | 216.0 | "Biscoe" | "FEMALE" | "Gentoo" |
292 | 5000.0 | 15.6 | 46.4 | 221.0 | "Biscoe" | "MALE" | "Gentoo" |
293 | 5100.0 | 15.6 | 48.2 | 221.0 | "Biscoe" | "MALE" | "Gentoo" |
294 | 5200.0 | 14.8 | 46.5 | 217.0 | "Biscoe" | "FEMALE" | "Gentoo" |
295 | 4700.0 | 15.0 | 46.4 | 216.0 | "Biscoe" | "FEMALE" | "Gentoo" |
296 | 5800.0 | 16.0 | 48.6 | 230.0 | "Biscoe" | "MALE" | "Gentoo" |
297 | 4600.0 | 14.2 | 47.5 | 209.0 | "Biscoe" | "FEMALE" | "Gentoo" |
298 | 6000.0 | 16.3 | 51.1 | 220.0 | "Biscoe" | "MALE" | "Gentoo" |
299 | 4750.0 | 13.8 | 45.2 | 215.0 | "Biscoe" | "FEMALE" | "Gentoo" |
300 | 5950.0 | 16.4 | 45.2 | 223.0 | "Biscoe" | "MALE" | "Gentoo" |
301 | 4625.0 | 14.5 | 49.1 | 212.0 | "Biscoe" | "FEMALE" | "Gentoo" |
302 | 5450.0 | 15.6 | 52.5 | 221.0 | "Biscoe" | "MALE" | "Gentoo" |
303 | 4725.0 | 14.6 | 47.4 | 212.0 | "Biscoe" | "FEMALE" | "Gentoo" |
304 | 5350.0 | 15.9 | 50.0 | 224.0 | "Biscoe" | "MALE" | "Gentoo" |
305 | 4750.0 | 13.8 | 44.9 | 212.0 | "Biscoe" | "FEMALE" | "Gentoo" |
306 | 5600.0 | 17.3 | 50.8 | 228.0 | "Biscoe" | "MALE" | "Gentoo" |
307 | 4600.0 | 14.4 | 43.4 | 218.0 | "Biscoe" | "FEMALE" | "Gentoo" |
308 | 5300.0 | 14.2 | 51.3 | 218.0 | "Biscoe" | "MALE" | "Gentoo" |
309 | 4875.0 | 14.0 | 47.5 | 212.0 | "Biscoe" | "FEMALE" | "Gentoo" |
310 | 5550.0 | 17.0 | 52.1 | 230.0 | "Biscoe" | "MALE" | "Gentoo" |
311 | 4950.0 | 15.0 | 47.5 | 218.0 | "Biscoe" | "FEMALE" | "Gentoo" |
312 | 5400.0 | 17.1 | 52.2 | 228.0 | "Biscoe" | "MALE" | "Gentoo" |
313 | 4750.0 | 14.5 | 45.5 | 212.0 | "Biscoe" | "FEMALE" | "Gentoo" |
314 | 5650.0 | 16.1 | 49.5 | 224.0 | "Biscoe" | "MALE" | "Gentoo" |
315 | 4850.0 | 14.7 | 44.5 | 214.0 | "Biscoe" | "FEMALE" | "Gentoo" |
316 | 5200.0 | 15.7 | 50.8 | 226.0 | "Biscoe" | "MALE" | "Gentoo" |
317 | 4925.0 | 15.8 | 49.4 | 216.0 | "Biscoe" | "MALE" | "Gentoo" |
318 | 4875.0 | 14.6 | 46.9 | 222.0 | "Biscoe" | "FEMALE" | "Gentoo" |
319 | 4625.0 | 14.4 | 48.4 | 203.0 | "Biscoe" | "FEMALE" | "Gentoo" |
320 | 5250.0 | 16.5 | 51.1 | 225.0 | "Biscoe" | "MALE" | "Gentoo" |
321 | 4850.0 | 15.0 | 48.5 | 219.0 | "Biscoe" | "FEMALE" | "Gentoo" |
322 | 5600.0 | 17.0 | 55.9 | 228.0 | "Biscoe" | "MALE" | "Gentoo" |
323 | 4975.0 | 15.5 | 47.2 | 215.0 | "Biscoe" | "FEMALE" | "Gentoo" |
324 | 5500.0 | 15.0 | 49.1 | 228.0 | "Biscoe" | "MALE" | "Gentoo" |
326 | 5500.0 | 16.1 | 46.8 | 215.0 | "Biscoe" | "MALE" | "Gentoo" |
327 | 4700.0 | 14.7 | 41.7 | 210.0 | "Biscoe" | "FEMALE" | "Gentoo" |
328 | 5500.0 | 15.8 | 53.4 | 219.0 | "Biscoe" | "MALE" | "Gentoo" |
329 | 4575.0 | 14.0 | 43.3 | 208.0 | "Biscoe" | "FEMALE" | "Gentoo" |
330 | 5500.0 | 15.1 | 48.1 | 209.0 | "Biscoe" | "MALE" | "Gentoo" |
331 | 5000.0 | 15.2 | 50.5 | 216.0 | "Biscoe" | "FEMALE" | "Gentoo" |
332 | 5950.0 | 15.9 | 49.8 | 229.0 | "Biscoe" | "MALE" | "Gentoo" |
333 | 4650.0 | 15.2 | 43.5 | 213.0 | "Biscoe" | "FEMALE" | "Gentoo" |
334 | 5500.0 | 16.3 | 51.5 | 230.0 | "Biscoe" | "MALE" | "Gentoo" |
335 | 4375.0 | 14.1 | 46.2 | 217.0 | "Biscoe" | "FEMALE" | "Gentoo" |
336 | 5850.0 | 16.0 | 55.1 | 230.0 | "Biscoe" | "MALE" | "Gentoo" |
338 | 6000.0 | 16.2 | 48.8 | 222.0 | "Biscoe" | "MALE" | "Gentoo" |
339 | 4925.0 | 13.7 | 47.2 | 214.0 | "Biscoe" | "FEMALE" | "Gentoo" |
341 | 4850.0 | 14.3 | 46.8 | 215.0 | "Biscoe" | "FEMALE" | "Gentoo" |
342 | 5750.0 | 15.7 | 50.4 | 222.0 | "Biscoe" | "MALE" | "Gentoo" |
343 | 5200.0 | 14.8 | 45.2 | 212.0 | "Biscoe" | "FEMALE" | "Gentoo" |
344 | 5400.0 | 16.1 | 49.9 | 213.0 | "Biscoe" | "MALE" | "Gentoo" |
Analyzing the Data
We can easily visualize the data we just loaded in different ways. For example, let’s take a look at the distribution of male and female penguins by species:
def output = vegalite:plot[
vegalite:bar[
:species,
{ :aggregate, "count" },
{ :data, penguin; :color, :sex; }
]
]
Preparing the Data
Once we have the data loaded, we need to transform the data in order to feed them into the machine learning models.
In general, we support a variety of machine learning models. The complete list of supported models can be found in the Machine Learning Library.
Most of these models require two relations:
- one containing the features to be used as inputs to train a model, and
- one containing the response (or target) variable (or, class, in our case) that we want to learn to predict.
To this end, we put the feature data in the features
relation and the class data (that are currently read as strings) in the response_string
relation.
Note that in the current implementation of the Machine Learning Library, the relation from which you extract the features (i.e., penguin
) needs to be an extensional database (EDB) relation.
This is done by using insert
earlier when you defined the penguin
relation.
def features = penguin[col]
for col in {
:island; :culmen_length_mm; :culmen_depth_mm;
:flipper_length_mm; :body_mass_g; :sex
}
def response_string = penguin:species
We can easily get statistics about our features
data using describe
:
table[describe[features]]
Relation: output
body_mass_g | culmen_depth_mm | culmen_length_mm | flipper_length_mm | island | sex | |
---|---|---|---|---|---|---|
count | 333 | 333 | 333 | 333 | 333 | 333 |
max | 6300.0 | 21.5 | 59.6 | 231.0 | "Torgersen" | "MALE" |
mean | 4207.057057057057 | 17.16486486486487 | 43.992792792792805 | 200.96696696696696 | ||
min | 2700.0 | 13.1 | 32.1 | 172.0 | "Biscoe" | "FEMALE" |
percentile25 | 3550.0 | 15.6 | 39.5 | 190.0 | ||
percentile50 | 4050.0 | 17.3 | 44.5 | 197.0 | ||
percentile75 | 4775.0 | 18.7 | 48.6 | 213.0 | ||
std | 805.2158019428964 | 1.9692354633199007 | 5.468668342647561 | 14.015765288287879 | ||
mode | "Biscoe" | "MALE" | ||||
mode_freq | 163 | 168 | ||||
unique | 3 | 2 |
and, of course, we can do the same for our response_string
data:
table[(:response, describe_full[response_string])]
Relation: output
response | |
---|---|
count | 333 |
max | "Gentoo" |
min | "Adelie" |
mode | "Adelie" |
mode_freq | 146 |
unique | 3 |
Here, we used describe_full
because we have only one column in the response_string
relation. Contrary to describe
, describe_full
provides statistics for the overall set of data rather than per feature.
Converting Class Names to Integers
You will use an mlpack
classifier, so you need to represent the response classes specifically as integers.
You cannot use strings or floats to represent the classes.
To this end, you will first identify all the unique classes. You can get them using last
:
def classes = last[response_string]
Next, we add numbers as an id for each class. We can do this using sort
, which sorts the classes and we can use the ordering index as the class id:
def id_class = sort[classes]
In order to join with the relation response_string
and get the ids, we
need to swap the first and second columns. We can do this using
transpose
:
def class_id = transpose[id_class]
Note that transpose
simply swaps the first and second columns, and is not to be confused with the typical matrix transposition. After we swap the columns, we can join with the response_string
relation:
def response = response_string.class_id
Of course, we could have done all this in one step as follows:
def response = response_string.(transpose[sort[last[response_string]]])
Creating Training and Test Datasets
In classification (as well as other machine learning approaches), we use a “training” dataset to learn a classification model and a “test” dataset to determine the accuracy of our model. In certain cases, we may also use a validation dataset for parameter tuning, but we will consider only training and test for the purposes of this how-to guide.
Because the penguin
dataset is not already split into training and test sets, we will have to create these two datasets.
In the following, we split our data into training and test sets with a ratio of 80/20. We specify the splitting ratio and the seed in split_params
. The splitting is done by mlpack_preprocess_split
, which splits the keys in the two sets. Afterwards, we join them with the features
and response
so that we generate the corresponding training and test data sets:
def split_params = {("test_ratio", "0.2"); ("seed", "42")}
def data_key(:keys, k) = features(_, k, _)
def data_key_split = mlpack_preprocess_split[data_key, split_params]
def feature_train(f, k, v) = features(f, k, v) and data_key_split(1, k)
def feature_test(f, k, v) = features(f, k, v) and data_key_split(2, k)
def response_train(k, v) = response(k, v) and data_key_split(1, k)
def response_test(k, v) = response(k, v) and data_key_split(2, k)
The relation split_params
specifies the exact splitting ratio between training and test sets. Note that the parameter name as well as the value need to be encoded as strings.
At this point, we can also add various checks to ensure that we have included all the instances from the original data set when we did the splitting in training and test. For example, we can check that the number of instances in training and test add up:
ic all_data() {
count[feature_train] + count[feature_test] = count[features]
}
Or, we can more rigorously ensure that we have actually performed a split using all the available data:
ic all_features() {
equal(features, union[feature_train, feature_test])
}
Building a Classifier
In this guide, we will be using mlpack to create a decision tree classifier. The decision tree classifier of mlpack (as well as most of the other classifiers) can accept a set of optional parameters to tune the specific algorithm. The parameters for each classifier (aka hyper-parameters) are documented in the Machine Learning Library reference.
We set the hyper-parameters through a relation (we call it hyper_param
here), as follows:
def hyper_param = {
("minimum_leaf_size", "10");
("minimum_gain_split", "1e-07")
}
Note that each classifier has its own parameters that you can find through the Machine Learning Library reference. Additionally, it is important to note that the parameters currently need to be passed as strings, similar to the example above. We can also pass no parameters to the classifier. In our example, we specified that we want the minimum number of instances in a leaf to be 10
and we set the minimum gain for node splitting to be 1e-07
.
At this point, we are ready to build our classifier. We will use mlpack_decision_tree
and specify the features for learning (i.e., the feature_train
relation), the classes to learn to predict (i.e., the response_train
relation), and the parameters:
def classifier = mlpack_decision_tree[
feature_train,
response_train,
hyper_param
]
Now we have a trained classifier with the relation classifier
, which represents the model we have learned.
Performing Predictions
Our trained model classifier
is now ready to make predictions. To make predictions, we have to use mlpack_decision_tree_predict
, where we need to provide:
- the trained ML model,
- a relation with features similar to the one that was used for training, and
- a number that indicates the number of keys used in the feature relation.
The information about the number of keys is necessary because it defines the arity of the relation with the features used to perform the predictions. In our case, we have only one key: the csv file position, which we carried over from the data loading step.
We can predict the penguin species using the training dataset:
def prediction_train = mlpack_decision_tree_predict[
classifier,
feature_train,
1
]
We can, of course, also predict the penguin species of the unseen test dataset:
def prediction_test = mlpack_decision_tree_predict[
classifier,
feature_test,
1
]
Let’s look at some predictions for the test dataset:
Evaluating Our Model
We can evaluate machine learning models using a variety of metrics. One popular way is the accuracy, which is defined as the fraction of the number of correct predictions over the total number of predictions.
We can compute the accuracy of the classifier
model on the training dataset as follows:
def train_accuracy =
count[pos : prediction_train[pos] = response_train[pos]] /
count[response_train]
Of course, what we really care about is the performance of our model on the test dataset:
def test_accuracy =
count[pos : prediction_test[pos] = response_test[pos]] /
count[response_test]
We can also compute precision and recall (aka sensitivity) metrics for each class
def score_precision[c] =
count[pos : prediction_test(pos, c) and response_test(pos, c)] /
count[pos : prediction_test(pos, c)]
def score_recall[c] =
count[pos : prediction_test(pos, c) and response_test(pos, c)] /
count[pos : response_test(pos, c)]
and query them.
With precision and recall metrics at hand, we can also compute the F1 score for each class:
def score_f1[c] =
2 * score_precision[c] * score_recall[c] /
(score_precision[c] + score_recall[c])
We can then query them.
Finally, we can compute the full confusion matrix (where actual
is the actual class, or response, and predicted
is the predicted class):
def confusion_matrix[predicted, actual] = count[
x : response_test(x, actual) and prediction_test(x, predicted)
]
When we query for it, we get:
Note that count
does not return 0
for an empty relation, which means that if no data record of class actual
was predicted to be of class predicted
then this pair does not appear in confusion_matrix
.
This reflects the fundamental principle that, in Rel, missing data (or NULL in SQL) is not explicitly stored or represented.
To assign a zero count to these missing values, we simply need to explicitly define that for any missing predicted-actual class pair, (predicted, actual)
, we want to assign a count of 0
.
This is done below with the left_override
(<++
) operator:
table[
confusion_matrix[class_column.class_id, class_row.class_id] <++0
for class_column in classes,
class_row in classes
]
Relation: output
Adelie | Chinstrap | Gentoo | |
---|---|---|---|
Adelie | 27 | 1 | 0 |
Chinstrap | 1 | 9 | 0 |
Gentoo | 0 | 1 | 27 |
Here we also convert the integer class IDs back to their original class names and state that we want the relation to be displayed as a wide table.
Training Multiple Classifiers
With Rel we can easily train and test multiple classifiers. Let’s consider an example.
We will train a set of classifiers on the same train and test datasets as before, but we will use a different set of hyper-parameters for each classifier.
We will use a relation called hyper_param
within a module called fine_tune
to keep all the different hyper-parameter configurations:
module fine_tune
def hyper_param = {
("Classifier 1", {("minimum_leaf_size", "10"); ("minimum_gain_split", "1e-07")});
("Classifier 2", {("minimum_leaf_size", "20"); ("maximum_depth", "3")});
("Classifier 3", {("minimum_leaf_size", "5"); ("maximum_depth", "0")});
}
end
In hyper_param
relation, we use an integer key (i.e, 1, 2, 3, ...
) to identify each hyper-parameter configuration.
This key will be useful to identify the classifiers from each configuration as well.
We can now train multiple classifiers easily as follows:
module fine_tune
def classifier[i] = mlpack_decision_tree[
feature_train,
response_train,
hyper_param[i]
]
end
Note that the call to mlpack_decision_tree
is the same as before, except we are iterating all the hyper-parameter configurations of the hyper_param
.
We can now create predictions for each of the trained classifiers on the test set:
module fine_tune
def prediction_test[i] = mlpack_decision_tree_predict[
fine_tune:classifier[i],
feature_test,
1
]
end
And, as the next step, we can compute the precision for each classifier:
module fine_tune
def score_precision(i, cl, score) =
c = count[ pos :
fine_tune:prediction_test(i, pos, id) and
response_test(pos, id)
]
and n = count[pos : fine_tune:prediction_test(i, pos, id)]
and score = c/n
and id = class_id[cl]
from c, n, id
end
def output = table[fine_tune:score_precision]
Relation: output
Classifier 1 | Classifier 2 | Classifier 3 | |
---|---|---|---|
Adelie | 0.9642857142857143 | 0.9642857142857143 | 0.9333333333333333 |
Chinstrap | 0.8181818181818182 | 0.8181818181818182 | 1.0 |
Gentoo | 1.0 | 1.0 | 1.0 |
Finally, we can plot the performance of each classifier over some specific metric. For example, we can show the precision of each classifier for each of the three classes as follows:
def precision_plot_data[:[], i] = {
(:classifier_id, cid);
(:class, cl);
(:precision, pr)
}
from cid, cl, pr where sort[fine_tune:score_precision](i, cid, cl, pr)
def chart:data:values = precision_plot_data
def chart:mark = "bar"
def chart:width = 300
def chart = vegalite_utils:x[{
(:field, "class");
}]
def chart = vegalite_utils:y[{
(:field, "precision");
(:type, "quantitative");
(:axis, :format, ".3f");
}]
def chart:encoding:xOffset = { (:field, "classifier_id"); (:type, "nominal");}
def chart:encoding:color:field = "classifier_id"
def output = vegalite:plot[chart]
Based on the analysis of performance of multiple classifiers, Rel allows us to easily determine which classifier is expected to perform the best. As an example, let’s pick the classifier with the maximum precision on the test set over all classes:
def score_precision_overall[i] =
count[pos : fine_tune:prediction_test[i, pos] = response_test[pos]] /
count[fine_tune:prediction_test[i]]
def max_precision_classifier_id = argmax[score_precision_overall]
def max_precision = score_precision_overall[max_precision_classifier_id]
def output:classifier = max_precision_classifier_id
def output:precision = max_precision
Relation: output
:classifier | "Classifier 3" |
:precision | 0.9696969696969697 |
Summary
We demonstrated the use of a decision tree classifier on our penguin dataset.
More specifically, we used mlpack_decision_tree
, i.e., a decision tree classifier from mlpack.
We support other classifiers as well.
For example:
In addition to mlpack, we also support other machine learning libraries such as glm or xgboost and we have more coming.
It is important to note here that all of our machine learning models are specifically designed to have the same API. In this way, we can easily swap machine learning models (of similar type, i.e., classification models). In our example in this guide we can simply switch mlpack_decision_tree
with mlpack_random_forest
, change the hyper_params
to the right parameters for mlpack_random_forest
(or just leave it empty to use the defaults), and we now have a random forest classifier.
In addition to the machine learning models, the Machine Learning Library also has useful functionality for other tasks. For example, we can perform k-nearest-neighbor search on a relation through mlpack_knn
or perform dimensionality reduction through kernel principal component analysis (KPCA) in a given dataset through mlpack_kernel_pca
.
For a complete list of machine learning models and related functionality, see the Machine Learning Library.