# The Machine Learning Library (ml)

Machine learning bindings for mlpack, GLM, XGboost.

## glm_generic

```
glm_generic[F, R, H]
```

A binding of the GLM.jl function `glm`

.
Fits a generalized linear model given features `F`

and responses `R`

and family and
link passed in hyperparameters `H`

. The supported families and links are listed below.

Input options:

`F`

: Relation of features to perform a GLM regression on.`R`

: Relation of responses to train the GLM regression model on.`H`

: Relation of hyperparameters to specify the family and link to use to generate the generalized linear model. Example:`H = {("family","Normal"); ("link","IdentityLink")}`

. Families supported:`["Binomial", "Bernoulli", "Binomial", "Gamma", "InverseGaussian", "NegativeBinomial", "Normal", "Poisson"]`

. Links supported:`["CauchitLink", "CloglogLink", "IdentityLink", "InverseLink", "InverseSquareLink", "LogitLink", "LogLink", "ProbitLink", "SqrtLink"]`

.

Result:

- A GLM model that can later be used with
`glm_predict[]`

.

Example:

`def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}`

def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}

def hyperparams = {("family", "NegativeBinomial"); ("link", "LogLink")}

def model = glm_generic[features, responses, hyperparams]

#### Definition

`@inline def glm_generic[F, R, H] = ext_ml_train[:glm_generic, F, R, H]`

## glm_linear_regression

```
glm_linear_regression[F, R]
```

A binding of the GLM.jl function `lm`

. Fits a linear regression model given
features `F`

and responses `R`

.

Note that this is unregularized linear regression, so if your model does not converge (e.g.
training gives a `PosDefException`

), try using regularized linear regression, perhaps via
`mlpack_linear_regression[]`

with the `lambda`

hyperparameter set, or ensure that the
columns of your data are not linearly dependent.

Input options:

`F`

: Relation of features to perform linear regression on.`R`

: Relation of responses to train the linear regression model on.

Result:

- A GLM model that can later be used with
`glm_predict`

.

Example:

`def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}`

def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}

def model = glm_linear_regression[features, responses]

#### Definition

`@inline def glm_linear_regression[F, R] = ext_ml_train[:glm_generic, F, R,`

{ a, b : (a = "family" and b = "Normal") or (a = "link" and b = "IdentityLink")}

]

## glm_logistic_regression

```
glm_logistic_regression[F, R]
```

A binding of the GLM.jl function `glm`

with the binomial family and Logit link.
Fits a logistic regression model given features `F`

and responses `R`

.

Input options:

`F`

: Relation of features to perform logistic regression on.`R`

: Relation of responses to train the logistic regression model on.

Result:

- A GLM model that can later be used with
`glm_predict`

.

Example:

`def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}`

def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}

def model = glm_logistic_regression[features, responses]

#### Definition

`@inline def glm_logistic_regression[F, R] = ext_ml_train[:glm_generic, F, R,`

{ a, b : (a = "family" and b = "Binomial") or (a = "link" and b = "LogitLink")}

]

## glm_predict

```
glm_predict[M, F, N]
```

A binding of the GLM.jl function `predict`

. Uses a generalized linear model `M`

to generate predictions for features `F`

. Here, `M`

can be produced from any of the
definitions `glm_linear_regression`

, `glm_logistic_regression`

, `glm_probit_regression`

,
or `glm_generic`

.

Input options:

`M`

: Relation containing the model generated by running one of the generalized linear models previously (e.g.`glm_linear_regression`

or`glm_generic`

).`F`

: Relation of features to generate the predictions given the previously computed model.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when computing predictions).

Result:

- Predictions of the features
`F`

after being fit with the model`M`

.

Example:

`def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}`

def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}

def model = glm_probit_regression[features, responses]

def predictions = glm_predict[model, features, 1]

#### Definition

`@inline def glm_predict[M, F, N] = ext_ml_predict[:glm_predict, M, F, N]`

## glm_probit_regression

```
glm_probit_regression[F, R]
```

A binding of the GLM.jl function `glm`

with the binomial family and Probit link.
Fits a probit regression model given features `F`

and responses `R`

.

Input options:

`F`

: Relation of features to perform probit regression on.`R`

: Relation of responses to train the probit regression model on.

Result:

- A GLM model that can later be used with
`glm_predict`

.

Example:

`def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}`

def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}

def model = glm_probit_regression[features, responses]

#### Definition

`@inline def glm_probit_regression[F, R] = ext_ml_train[:glm_generic, F, R,`

{ a, b : (a = "family" and b = "Binomial") or (a = "link" and b = "ProbitLink")}

]

## mlpack_adaboost

```
mlpack_adaboost[F, R, H]
```

An implementation of the AdaBoost.MH (Adaptive Boosting) algorithm for classification. This can be used to train an AdaBoost model on labeled data.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`iterations`

(`Int`

): The maximum number of boosting iterations to be run (0 will run until convergence.) Default`1000`

.`tolerance`

(`Float64`

): The tolerance for change in values of the weighted error during training. Default`1e-10`

.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and timers at the end of execution.`weak_learner`

(`String`

): The type of weak learner to use:`decision_stump`

, or`perceptron`

. Default`decision_stump`

.

#### Definition

`@inline def mlpack_adaboost[F, R, H] = ext_ml_train[:mlpack_adaboost, F, R, H]`

## mlpack_adaboost_predict

```
mlpack_adaboost_predict[M, F, N]
```

Given an AdaBoost.MH model trained with `mlpack_adaboost[`

, make class predictions on a
test set.

See also the mlpack
documentation and the
documentation for `mlpack_adaboost[]`

for more details.

Inputs:

`M`

: AdaBoost model to use for prediction; must be the result of a previous`mlpack_adaboost[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_adaboost_predict[M, F, N] =`

ext_ml_predict[:mlpack_adaboost_predict, M, F, N]

## mlpack_approx_kfn

```
mlpack_approx_kfn[K, M, Q, N, H]
```

Perform approximate k-furthest-neighbor search on a relation `Q`

containing query points,
using a model `M`

that was build with `mlpack_approx_kfn_build[]`

.

See also the mlpack documentation for more details.

Inputs:

`K`

: constant representing number of nearest neighbors to search for.`M`

: pre-trained model for kNN; must be the result of a previous`mlpack_knn_build[]`

call.`Q`

: relation of query points; must have the same number of keys as the relation that`M`

was built with.`N`

: constant indicating the number of arguments in`Q`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`calculate_error`

(`Bool`

): If set, calculate and display the average distance error for the first furthest neighbor only.`verbose`

(`Bool`

): Display informational messages.

Result:

- A relation mapping keys from
`Q`

to keys in the reference set that the model`M`

was built on. The form is (query_keys…, k, reference_keys…, distance) where`k`

takes values between`1`

and`K`

for each possible set of`query_keys...`

. Given`query_keys...`

and`k`

, then`reference_keys...`

is the set of keys associated with the`k`

‘th approximate furthest neighbor, and`distance`

is the Euclidean distance between the point associated with`query_keys...`

and the point associated with`reference_keys...`

.

#### Definition

`@inline def mlpack_approx_kfn[K, M, Q, N, H] =`

ext_ml_transform[:mlpack_approx_kfn, K, M, Q, N, H]

## mlpack_approx_kfn_build

```
mlpack_approx_kfn_build[R, N, H]
```

An implementation of two strategies for furthest neighbor search. This creates a furthest neighbor search model that can be reused later.

See also the mlpack
documentation and
the documentation for `mlpack_approx_kfn[]`

for more details.

Inputs:

`R`

: relation of reference points that tree should be built on`N`

: constant indicating the number of arguments in`R`

that correspond to keys (i.e. dimensions that should not be considered when building the model).`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`algorithm`

(`String`

): Algorithm to use:`"ds"`

or`"qdafn"`

. Default`"ds"`

.`num_projections`

(`Int`

): Number of projections to use in each hash table. Default`5`

.`num_tables`

(`Int`

): Number of hash tables to use. Default`5`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- An approximate KFN model that can be used in a later call to
`mlpack_approx_kfn[]`

.

#### Definition

`@inline def mlpack_approx_kfn_build[R, N, H] = ext_ml_build[:mlpack_approx_kfn, R, N, H]`

## mlpack_dbscan

```
mlpack_dbscan[F, N, H]
```

A clustering of the dataset `F`

using DBSCAN clustering with parameters `N`

and `H`

.

See the mlpack documentation for more details.

Inputs:

`F`

: relation of data points to cluster.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when clustering the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`epsilon`

(`Float64`

): Radius of each range search. Default`1`

.`min_size`

(`Int`

): Minimum number of points for a cluster. Default`5`

.`naive`

(`Bool`

): If set, brute-force range search (not tree-based) will be used. Default`false`

.`selection_type`

(`String`

): If using point selection policy, the type of selection to use (`"ordered"`

,`"random"`

). Default`"ordered"`

.`single_mode`

(`Bool`

): If set, single-tree range search (not dual-tree) will be used. Default`false`

.`tree_type`

(`String`

): If using single-tree or dual-tree search, the type of tree to use (`"kd"`

,`"r"`

,`"r-star"`

,`"x"`

,`"hilbert-r"`

,`"r-plus"`

,`"r-plus-plus"`

,`"cover"`

,`"ball"`

). Default`"kd"`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- A relation containing the keys in
`F`

, with a cluster assignment (`Int`

) as the last argument. If the point is considered “noise” (i.e. not part of any cluster), the cluster assignment is 0.

#### Definition

`@inline def mlpack_dbscan[F, N, H] = ext_ml_transform[:mlpack_dbscan, 0, {()}, F, N, H]`

## mlpack_decision_tree

```
mlpack_decision_tree[F, R, H]
```

An implementation of an ID3-style decision tree for classification, which supports
categorical data. This binding accepts categorical features in `F`

; a feature in `F`

is
interpreted as categorical if it is an entity or has `String`

type.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`maximum_depth`

(`Int`

): Maximum depth of the tree (0 means no limit). Default`0`

.`minimum_gain_split`

(`Float64`

): Minimum gain for node splitting. Default`1e-7`

.`minimum_leaf_size`

(`Int`

): Minimum number of points in a leaf. Default`20`

.`print_training_accuracy`

(`Bool`

): Print the training accuracy. Default`false`

.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and timers at the end of execution.

#### Definition

`@inline def mlpack_decision_tree[F, R, H] = ext_ml_train[:mlpack_decision_tree, F, R, H]`

## mlpack_decision_tree_predict

```
mlpack_decision_tree_predict[M, F, N]
```

Given a decision tree model trained with `mlpack_decision_tree[]`

, make class predictions
on a test set.

See also the mlpack
documentation and
the documentation for `mlpack_decision_tree[]`

for more details.

Inputs:

`M`

: decision tree model to use for prediction; must be the result of a previous`mlpack_decision_tree[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_decision_tree_predict[M, F, N] =`

ext_ml_predict[:mlpack_decision_tree_predict, M, F, N]

## mlpack_det

```
mlpack_det[M, F, N, H]
```

Given a DET trained with `mlpack_det_build[]`

, compute densities of the query points in
the relation `F`

.

See also the mlpack
documentation and the
documentation for `mlpack_det[]`

for more details.

Inputs:

`M`

: pre-trained DET model; must be the result of a previous`mlpack_det_build[]`

call.`F`

: relation of features to compute density estimates for.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation mapping keys from
`F`

(i.e. the first`N`

elements of the tuples in`F`

) to their density estimates.

#### Definition

`@inline def mlpack_det[M, F, N, H] = ext_ml_transform[:mlpack_det, 0, M, F, N, H]`

## mlpack_det_build

```
mlpack_det_build[F, N, H]
```

An implementation of density estimation trees for the density estimation task. Density estimation trees can be trained with this native.

See also the mlpack
documentation and the
documentation for `mlpack_det[]`

for more details.

Inputs:

`F`

: relation of features to build density estimation tree on.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`folds`

(`Int`

): The number of folds of cross-validation to perform for the estimation (0 is LOOCV). Default`10`

.`max_leaf_size`

(`Int`

): The maximum size of a leaf in the unpruned, fully grown DET. Default`10`

.`min_leaf_size`

(`Int`

): The minimum size of a leaf in the unpruned, fully grown DET. Default`5`

.`skip_pruning`

(`Bool`

): Whether to bypass the pruning process and output the unpruned tree only. Default`false`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

#### Definition

`@inline def mlpack_det_build[F, N, H] = ext_ml_build[:mlpack_det, F, N, H]`

## mlpack_emst

```
mlpack_emst[F, N, H]
```

An implementation of the Dual-Tree Boruvka algorithm for computing the Euclidean minimum spanning tree of a set of input points.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of data points to compute the minimum spanning tree of.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when clustering the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`leaf_size`

(`Int`

): Leaf size in the kd-tree. One-element leaves give the empirically best performance, but at the cost of greater memory requirements. Default`1`

.`naive`

(`Bool`

): Compute the MST using O(n^2) naive algorithm. Default`false`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- An ordered edge relation with weights. Specifically, each point in
`F`

is associated with a set of`N`

keys. The first argument of the output relation is the index of the edge (starting from 1); lower-weighted edges have lower indices. The following`N`

arguments of the output relation correspond to the first vertex; the following`N`

arguments of the output relation correspond to the second vertex; and the last argument represents the distance between those two vertices.

#### Definition

`@inline def mlpack_emst[F, N, H] = ext_ml_transform[:mlpack_emst, 0, {()}, F, N, H]`

## mlpack_fastmks

```
mlpack_fastmks[K, M, Q, N, H]
```

Perform max-kernel search search on a relation `Q`

containing query points, using a
model `M`

that was built with `mlpack_fastmks_build[]`

.

See also the mlpack
documentation and the
documentation for `mlpack_fastmks_build[]`

for more details.

Inputs:

`K`

: constant representing number of max kernels to search for.`M`

: pre-trained model for kNN; must be the result of a previous`mlpack_fastmks_build[]`

call.`Q`

: relation of query points; must have the same number of keys as the relation that`M`

was built with.`N`

: constant indicating the number of arguments in`Q`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`verbose`

(`Bool`

): Display informational messages.

Result:

- A relation mapping keys from
`Q`

to keys in the reference set that the model`M`

was built on. The form is (query_keys…, k, reference_keys…, kernel) where`k`

takes values between`1`

and`K`

for each possible set of`query_keys...`

. Given`query_keys...`

and`k`

, then`reference_keys...`

is the set of keys associated with the`k`

‘th max-kernel, and`kernel`

is the kernel value between the point associated with`query_keys...`

and the point associated with`reference_keys...`

.

#### Definition

`@inline def mlpack_fastmks[K, M, Q, N, H] = ext_ml_transform[:mlpack_fastmks, K, M, Q, N, H]`

## mlpack_fastmks_build

```
mlpack_fastmks_build[R, N, H]
```

An implementation of max-kernel search using single-tree and dual-tree algorithms. Given
a set of reference points and query points, this can build trees that can be used in
later calls to `mlpack_fastmks[]`

.

See also the mlpack
documentation and the
documentation for `mlpack_fastmks[]`

for more details.

Inputs:

`R`

: relation of reference points that tree should be built on`N`

: constant indicating the number of arguments in`R`

that correspond to keys (i.e. dimensions that should not be considered when building the model).`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`bandwidth`

(`Float64`

): Bandwidth (for Gaussian, Epanechnikov, and triangular kernels). Default`1`

.`base`

(`Float64`

): Base to use during cover tree construction. Default`2`

.`degree`

(`Float64`

): Degree of polynomial kernel. Default`2`

.`kernel`

(`String`

): Kernel type to use:`"linear"`

,`"polynomial"`

,`"cosine"`

,`"gaussian"`

,`"epanechnikov"`

,`"triangular"`

,`"hyptan"`

. Default`"linear"`

.`naive`

(`Bool`

): If true, O(n^2) naive mode is used for computation. Default`false`

.`offset`

(`Float64`

): Offset of kernel (for polynomial and hyptan kernels). Default`0`

.`scale`

(`Float64`

): Scale of kernel (for hyptan kernel). Default`1`

.`single`

(`Bool`

): If true, single-tree search is used (as opposed to dual-tree search. Default`false`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- A FastMKS model that can be used in a later call to
`mlpack_fastmks[]`

.

#### Definition

`@inline def mlpack_fastmks_build[R, N, H] = ext_ml_build[:mlpack_fastmks, R, N, H]`

## mlpack_gmm_generate

```
mlpack_gmm_generate[S, M, D, H]
```

A sample generator for pre-trained GMMs. Given a pre-trained GMM, this can sample new points randomly from that distribution.

See also the mlpack documentation for more details.

Inputs:

`S`

: constant indicating the number of samples to generate.`M`

: pre-trained GMM from`mlpack_gmm_train[]`

.`D`

: constant representing the dimensionality of the model (i.e. the dimensionality of`F`

in the call to`mlpack_gmm_train[]`

).`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`seed`

(`Int`

): Random seed. If`0`

,`std::time(NULL)`

is used. Default`0`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation containing
`S`

samples from the given GMM`M`

. The first argument is the key (an integer between`1`

and`S`

) and the rest of the arguments are each of the features.

#### Definition

`@inline def mlpack_gmm_generate[S, M, D, H] =`

ext_ml_transform[:mlpack_gmm_generate, S, M, {()}, D, H]

## mlpack_gmm_probability

```
mlpack_gmm_probability[M, F, N, H]
```

A probability calculator for GMMs. Given a pre-trained GMM and a set of points, this can compute the probability that each point is from the given GMM.

See also the mlpack documentation for more details.

Inputs:

`M`

: pre-trained GMM from`mlpack_gmm_train[]`

.`F`

: relation of data points to compute the probabilities of.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when clustering the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation containing the keys of
`F`

(that is, the first`N`

arguments) mapping to the probability that each of those samples arose from the GMM`M`

.

#### Definition

`@inline def mlpack_gmm_probability[M, F, N, H] =`

ext_ml_transform[:mlpack_gmm_probability, 0, M, F, N, H]

## mlpack_gmm_train

```
mlpack_gmm_train[F, N, H]
```

An implementation of the EM algorithm for training Gaussian mixture models (GMMs). Given a dataset, this can train a GMM for future use with other tools.

See also the mlpack documentation for more details.

Inputs:

`R`

: relation of reference points that model should be built on`N`

: constant indicating the number of arguments in`R`

that correspond to keys (i.e. dimensions that should not be considered when building the model).`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`diagonal_covariance`

(`Bool`

): Force the covariance of the Gaussians to be diagonal. This can accelerate training time significantly. Default`false`

.`gaussians`

(`Int`

): Number of Gaussians in the GMM.*Required*.`kmeans_max_iterations`

(`Int`

): Maximum number of iterations for the k-means algorithm (used to initialize EM). Default`1000`

.`max_iterations`

(`Int`

): Maximum number of iterations of EM algorithm (passing 0 will run until convergence). Default`250`

.`no_force_positive`

(`Bool`

): Do not force the covariance matrices to be positive definite. Default`false`

.`noise`

(`Float64`

): Variance of zero-mean Gaussian noise to add to data. Default`0`

.`percentage`

(`Float64`

): If using`refined_start`

, specify the percentage of the dataset used for each sampling (should be between 0.0 and 1.0). Default`0.02`

.`refined_start`

(`Bool`

): During the initialization, use refined initial positions for k-means clustering (Bradley and Fayyad, 1998). Default`false`

.`samplings`

(`Int`

): If using`refined_start`

, specify the number of samplings used for initial points. Default`100`

.`seed`

(`Int`

): Random seed. If`0`

,`std::time(NULL)`

is used. Default`0`

.`tolerance`

(`Float64`

): Tolerance for convergence of EM. Default`1e-10`

.`trials`

(`Int`

): Number of trials to perform in training GMM. Default`1`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

#### Definition

`@inline def mlpack_gmm_train[F, N, H] = ext_ml_build[:mlpack_gmm_train, F, N, H]`

## mlpack_hoeffding_tree

```
mlpack_hoeffding_tree[F, R, H]
```

An implementation of Hoeffding trees, a form of streaming decision tree for
classification. Given labeled data, a Hoeffding tree can be trained. This binding accepts
categorical features in `F`

; a feature in `F`

is interpreted as categorical if it is an
entity or has `String`

type.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`batch_mode`

(`Bool`

): If true, samples will be considered in batch instead of as a stream. This generally results in better trees but at the cost of memory usage and runtime.`bins`

(`Int`

): If the`domingos`

split strategy is used, this specifies the number of bins for each numeric split. Default`10`

.`confidence`

(`Float64`

): Confidence before splitting (between 0 and 1). Default`0.95`

.`info_gain`

(`Bool`

): If set, information gain is used instead of Gini impurity for calculating Hoeffding bounds.`max_samples`

(`Int`

): Maximum number of samples before splitting. Default`5000`

.`min_samples`

(`Int`

): Minimum number of samples before splitting. Default`100`

.`numeric_split_strategy`

(`String`

): The splitting strategy to use for numeric features:`domingos`

or`binary`

. Default`binary`

.`observations_before_binning`

(`Int`

): If the`domingos`

split strategy is used, this specifies the number of samples observed before binning is performed.`passes`

(`Int`

): Number of passes to take over the dataset. Default`1`

.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and timers at the end of execution.

#### Definition

`@inline def mlpack_hoeffding_tree[F, R, H] = ext_ml_train[:mlpack_hoeffding_tree, F, R, H]`

## mlpack_hoeffding_tree_predict

```
mlpack_hoeffding_tree_predict[M, F, N]
```

Given a Hoeffding tree model trained with `mlpack_hoeffding_tree[]`

, make class
predictions on a test set.

See also the mlpack
documentation
and the documentation for `mlpack_hoeffding_tree[]`

for more details.

Inputs:

`M`

: Hoeffding tree model to use for prediction; must be the result of a previous`mlpack_hoeffding_tree[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_hoeffding_tree_predict[M, F, N] =`

ext_ml_predict[:mlpack_hoeffding_tree_predict, M, F, N]

## mlpack_kernel_pca

```
mlpack_kernel_pca[D, F, N, H]
```

An implementation of Kernel Principal Components Analysis (KPCA). This can be used to perform nonlinear dimensionality reduction or preprocessing on a given dataset.

See also the mlpack documentation for more details.

Input options:

`D`

: constant indicating the desired new dimensionality of the data.`F`

: relation of features to perform kernel PCA on.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`bandwidth`

(`Float64`

): Bandwidth, for`"gaussian"`

and`"laplacian"`

kernels. Default`1`

.`center`

(`Bool`

): If set, the transformed data will be centered about the origin. Default`false`

.`degree`

(`Float64`

): Degree of polynomial, for ‘polynomial’ kernel. Default`1`

.`kernel`

(`String`

): The kernel to use; see the linked documentation for the list of usable kernels. Default`"gaussian"`

.`kernel_scale`

(`Float64`

): Scale, for`"hyptan"`

kernel. Default`1`

.`nystroem_method`

(`Bool`

): If set, the Nystroem method will be used. Default`false`

.`offset`

(`Float64`

): Offset, for`"hyptan"`

and`"polynomial"`

kernels. Default`0`

.`sampling`

(`String`

): Sampling scheme to use for the Nystroem method:`"kmeans"`

,`"random"`

,`"ordered"`

. Default`"kmeans"`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation mapping keys in
`F`

(i.e. the first`N`

arguments of`F`

) to`D`

values in each dimension.

#### Definition

`@inline def mlpack_kernel_pca[D, F, N, H] =`

ext_ml_transform[:mlpack_kernel_pca, D, {()}, F, N, H]

## mlpack_kfn

```
mlpack_kfn[K, M, Q, N, H]
```

Perform k-furthest-neighbor search on a relation `Q`

containing query points, using a
model `M`

that was built with `mlpack_kfn_build[]`

.

See also the mlpack
documentation and the
documentation for `mlpack_kfn_build[]`

for more details.

Inputs:

`K`

: constant representing number of nearest neighbors to search for.`M`

: pre-trained model for kNN; must be the result of a previous`mlpack_knn_build[]`

call.`Q`

: relation of query points; must have the same number of keys as the relation that`M`

was built with.`N`

: constant indicating the number of arguments in`Q`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`algorithm`

(`String`

): Type of neighbor search: “naive”, “single_tree”, “dual_tree”, “greedy”. Default`"dual_tree"`

.`epsilon`

(`Float64`

): If specified, will do approximate nearest neighbor search with given relative error. Default`0`

.`percentage`

(`Float64`

): If specified, will do approximate furthest neighbor search. Must be in the range`(0,1]`

(decimal form). Resultant neighbors will be at least`(p*100)%`

of the distance as the true furthest neighbor. Default`1`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- A relation mapping keys from
`Q`

to keys in the reference set that the model`M`

was built on. The form is (query_keys…, k, reference_keys…, distance) where`k`

takes values between`1`

and`K`

for each possible set of`query_keys...`

. Given`query_keys...`

and`k`

, then`reference_keys...`

is the set of keys associated with the`k`

‘th furthest neighbor, and`distance`

is the Euclidean distance between the point associated with`query_keys...`

and the point associated with`reference_keys...`

.

#### Definition

`@inline def mlpack_kfn[K, M, Q, N, H] = ext_ml_transform[:mlpack_kfn, K, M, Q, N, H]`

## mlpack_kfn_build

```
mlpack_kfn_build[R, N, H]
```

An implementation of k-furthest-neighbor search using single-tree and dual-tree algorithms. This can build a tree that can be saved for future use.

See also the mlpack
documentation and the
documentation for `mlpack_kfn[]`

for more details.

Inputs:

`R`

: relation of reference points that tree should be built on`N`

: constant indicating the number of arguments in`R`

that correspond to keys (i.e. dimensions that should not be considered when building the model).`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`leaf_size`

(`Int`

): Leaf size for tree building (used for kd-trees, vp trees, random projection trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, and octrees). Default`20`

.`random_basis`

(`Bool`

): Before tree-building, project the data onto a random orthogonal basis. Default`false`

.`seed`

(`Int`

): Random seed (if`0`

,`std::time(NULL)`

is used). Default`0`

.`tree_type`

(`String`

): Type of tree to use:`"kd"`

,`"vp"`

,`"rp"`

,`"max-rp"`

,`"ub"`

,`"cover"`

,`"r"`

,`"r-star"`

, “x”, “ball”, “hilbert-r”, “r-plus”`,`

“r-plus-plus”`,`

“oct”`. Default`

“kd”`.`verbose`

(`Bool`

): Display informational messages.

Result:

- A KFN model that can be used with a later call to
`mlpack_kfn[]`

.

#### Definition

`@inline def mlpack_kfn_build[R, N, H] = ext_ml_build[:mlpack_kfn, R, N, H]`

## mlpack_kmeans

```
mlpack_kmeans[K, F, N, H]
```

An implementation of several strategies for efficient k-means clustering. Given a dataset and a value of k, this computes a k-means clustering on that data.

See also the mlpack documentation for more details.

Inputs:

`K`

: constant indicating the number of clusters for k-means clustering.`F`

: relation of data points to cluster.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when clustering the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`algorithm`

(`String`

): Algorithm to use for the Lloyd iteration (`"naive"`

,`"pelleg-moore"`

,`"elkan"`

,`"hamerly"`

,`"dualtree"`

, or`"dualtree-covertree"`

). Default`"naive"`

.`allow_empty_clusters`

(`Bool`

): Allow empty clusters to persist. Default`false`

.`kill_empty_clusters`

(`Bool`

): Remove empty clusters when they occur. Default`false`

.`max_iterations`

(`Int`

): Maximum number of iterations before k-means terminates. Default`1000`

.`percentage`

(`Float64`

): Percentage of dataset to use for each refined start sampling (use when`refined_start`

is specified). Default`0.02`

.`refined_start`

(`Bool`

): Use the refined initial point strategy by Bradley and Fayyad to choose initial points. Default`false`

.`samplings`

(`Int`

): Number of samplings to perform for refined start (use when`refined_start`

is specified). Default`100`

.`seed`

(`Int`

): Random seed. If`0`

,`std::time(NULL)`

is used. Default`0`

.`verbose`

(`Bool`

): Display information messages. Default`false`

.

Result:

- A relation containing the keys in
`F`

with a cluster assignment (`Int`

) between`1`

and`K`

as the last argument.

#### Definition

`@inline def mlpack_kmeans[K, F, N, H] = ext_ml_transform[:mlpack_kmeans, K, {()}, F, N, H]`

## mlpack_kmeans_centroids

```
mlpack_kmeans_centroids[K, F, N, H]
```

An implementation of several strategies for efficient k-means clustering. Given a dataset and a value of k, this computes centroids for a k-means clustering on that data.

See also the mlpack documentation for more details.

Inputs:

`K`

: constant indicating the number of clusters for k-means clustering.`F`

: relation of data points to cluster.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when clustering the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`algorithm`

(`String`

): Algorithm to use for the Lloyd iteration (`"naive"`

,`"pelleg-moore"`

,`"elkan"`

,`"hamerly"`

,`"dualtree"`

, or`"dualtree-covertree"`

). Default`"naive"`

.`allow_empty_clusters`

(`Bool`

): Allow empty clusters to persist. Default`false`

.`kill_empty_clusters`

(`Bool`

): Remove empty clusters when they occur. Default`false`

.`max_iterations`

(`Int`

): Maximum number of iterations before k-means terminates. Default`1000`

.`percentage`

(`Float64`

): Percentage of dataset to use for each refined start sampling (use when`refined_start`

is specified). Default`0.02`

.`refined_start`

(`Bool`

): Use the refined initial point strategy by Bradley and Fayyad to choose initial points. Default`false`

.`samplings`

(`Int`

): Number of samplings to perform for refined start (use when`refined_start`

is specified). Default`100`

.`seed`

(`Int`

): Random seed. If`0`

,`std::time(NULL)`

is used. Default`0`

.`verbose`

(`Bool`

): Display information messages. Default`false`

.

Result:

- A relation containing a cluster index between
`1`

and`K`

that maps to the centroid of each dimension in`F`

. So, the first argument of this relation is the cluster index, and the rest correspond to the arguments of`F`

that are after the first`N`

key arguments.

#### Definition

`@inline def mlpack_kmeans_centroids[K, F, N, H] =`

ext_ml_transform[:mlpack_kmeans_centroids, K, {()}, F, N, H]

## mlpack_knn

```
mlpack_knn[K, M, Q, N, H]
```

Perform k-nearest-neighbor search on a relation `Q`

containing query points, using a
model `M`

that was built with `mlpack_knn_build[]`

.

See also the mlpack
documentation and the
documentation for `mlpack_knn_build[]`

for more details.

Inputs:

`K`

: constant representing number of nearest neighbors to search for.`M`

: pre-trained model for kNN; must be the result of a previous`mlpack_knn_build[]`

call.`Q`

: relation of query points; must have the same number of keys as the relation that`M`

was built with.`N`

: constant indicating the number of arguments in`Q`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`algorithm`

(`String`

): Type of neighbor search: “naive”, “single_tree”, “dual_tree”, “greedy”. Default`"dual_tree"`

.`epsilon`

(`Float64`

): If specified, will do approximate nearest neighbor search with given relative error. Default`0`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- A relation mapping keys from
`Q`

to keys in the reference set that the model`M`

was built on. The form is (query_keys…, k, reference_keys…, distance) where`k`

takes values between`1`

and`K`

for each possible set of`query_keys...`

. Given`query_keys...`

and`k`

, then`reference_keys...`

is the set of keys associated with the`k`

‘th nearest neighbor, and`distance`

is the Euclidean distance between the point associated with`query_keys...`

and the point associated with`reference_keys...`

.

#### Definition

`@inline def mlpack_knn[K, M, Q, N, H] = ext_ml_transform[:mlpack_knn, K, M, Q, N, H]`

## mlpack_knn_build

```
mlpack_knn_build[R, N, H]
```

An implementation of k-nearest-neighbor search using single-tree and dual-tree
algorithms. Given a set of reference points and query points, this can build trees that
can be used in later calls to `mlpack_knn[]`

.

See also the mlpack
documentation and the
documentation for `mlpack_knn[]`

for more details.

Inputs:

`R`

: relation of reference points that tree should be built on`N`

: constant indicating the number of arguments in`R`

that correspond to keys (i.e. dimensions that should not be considered when building the model).`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`leaf_size`

(`Int`

): Leaf size for tree building (used for kd-trees, vp trees, random projection trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, spill trees, and octrees). Default`20`

.`random_basis`

(`Bool`

): before tree-building, project the data onto a random orthogonal basis. Default`false`

.`rho`

(`Float64`

): Balance threshold (only valid for spill trees). Default`0.7`

.`tau`

(`Float64`

): Overlapping size (only valid for spill trees). Default`0`

.`tree_type`

(`String`

): Type of tree to use: “kd”, “vp”, “rp”, “max-rp”, “ub”, “cover”, “r”, “r-star”, “x”, “ball”, “hilbert-r”, “r-plus”, “r-plus-plus”, “spill”, “oct”. Default`"kd"`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- A KNN model that can be used in a later call to
`mlpack_knn[]`

.

#### Definition

`@inline def mlpack_knn_build[R, N, H] = ext_ml_build[:mlpack_knn, R, N, H]`

## mlpack_krann

```
mlpack_krann[K, M, Q, N, H]
```

Perform k-rank-approximate-nearest-neighbor search on a relation `Q`

containing query
points, using a model `M`

that was built with `mlpack_krann_build[]`

.

See also the mlpack
documentation and the
documentation for `mlpack_krann_build[]`

for more details.

Inputs:

`K`

: constant representing number of nearest neighbors to search for.`M`

: pre-trained model for kRANN; must be the result of a previous`mlpack_krann_build[]`

call.`Q`

: relation of query points; must have the same number of keys as the relation that`M`

was built with.`N`

: constant indicating the number of arguments in`Q`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`alpha`

(`Float64`

): The desired success probability. Default`0.95`

.`tau`

(`Float64`

): The allowed rank-error in terms of the percentile of the data. Default`5`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- A relation mapping keys from
`Q`

to keys in the reference set that the model`M`

was built on. The form is (query_keys…, k, reference_keys…, distance) where`k`

takes values between`1`

and`K`

for each possible set of`query_keys...`

. Given`query_keys...`

and`k`

, then`reference_keys...`

is the set of keys associated with the`k`

‘th rank-approximate nearest neighbor, and`distance`

is the Euclidean distance between the point associated with`query_keys...`

and the point associated with`reference_keys...`

.

#### Definition

`@inline def mlpack_krann[K, M, Q, N, H] = ext_ml_transform[:mlpack_krann, K, M, Q, N, H]`

## mlpack_krann_build

```
mlpack_krann_build[R, N, H]
```

An implementation of rank-approximate k-nearest-neighbor search (kRANN) using single-tree
and dual-tree algorithms. Given a set of reference points and query points, this can
build trees that can be used in later calls to `mlpack_krann[]`

.

See also the mlpack
documentation and the
documentation for `mlpack_krann[]`

for more details.

Inputs:

`R`

: relation of reference points that tree should be built on`N`

: constant indicating the number of arguments in`R`

that correspond to keys (i.e. dimensions that should not be considered when building the model).`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`first_leaf_exact`

(`Bool`

): The flag to trigger sampling only after exactly exploring the first leaf. Default`false`

.`leaf_size`

(`Int`

): Leaf size for tree building (used for kd-trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, and octrees). Default`20`

.`naive`

(`Bool`

): If true, sampling will be done without using a tree. Default`false`

.`random_basis`

(`Bool`

): Before tree-building, project the data onto a random orthogonal basis. Default`false`

.`sample_at_leaves`

(`Bool`

): The flag to trigger sampling at leaves. Default`false`

.`seed`

(`Int`

): Random seed (if 0, std::time(NULL) is used). Default`0`

.`single_mode`

(`Bool`

): If true, single-tree search is used (as opposed to dual-tree search). Default`false`

.`single_sample_limit`

(`Int`

): The limit on the maximum number of samples (and hence the largest node you can approximate). Default`20`

.`tree_type`

(`String`

): Type of tree to use:`"kd"`

,`"ub"`

,`"cover"`

,`"r"`

,`"x"`

,`"r-star"`

,`"hilbert-r"`

,`"r-plus"`

,`"r-plus-plus"`

,`"oct"`

. Default`"kd"`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- A rank-approximate KNN model that can be used in a later call to
`mlpack_krann[]`

.

#### Definition

`@inline def mlpack_krann_build[R, N, H] = ext_ml_build[:mlpack_krann, R, N, H]`

## mlpack_lars

```
mlpack_lars[F, R, H]
```

An implementation of Least Angle Regression (Stagewise/laSso), also known as LARS. This can train a LARS/LASSO/Elastic Net model.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`lambda1`

(`Float64`

): Regularization parameter for l1-norm penalty. Default`0`

.`lambda2`

(`Float64`

): Regularization parameter for l2-norm penalty. Default`0`

.`use_cholesky`

(`Bool`

): Use Cholesky decomposition during computation rather than explicitly computing the full Gram matrix.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and timers at the end of execution.

#### Definition

`@inline def mlpack_lars[F, R, H] = ext_ml_train[:mlpack_lars, F, R, H]`

## mlpack_lars_predict

```
mlpack_lars_predict[M, F, N]
```

Given a LARS model trained with `mlpack_lars[]`

, make predictions on a test set.

See also the mlpack
documentation and the
documentation for `mlpack_lars[]`

for more details.

Inputs:

`M`

: LARS model to use for prediction; must be the result of a previous`mlpack_lars[]`

call`F`

: relation of test features for which predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_lars_predict[M, F, N] = ext_ml_predict[:mlpack_lars_predict, M, F, N]`

## mlpack_linear_regression

```
mlpack_linear_regression[F, R, H]
```

An implementation of simple linear regression and ridge regression using ordinary least squares. Given a dataset and responses, a model can be trained.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`lambda`

(`Float64`

): Tikhonov regularization for ridge regression. If`0`

, the method reduces to linear regression. Default`0`

.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and timers at the end of execution.

#### Definition

`@inline def mlpack_linear_regression[F, R, H] =`

ext_ml_train[:mlpack_linear_regression, F, R, H]

## mlpack_linear_regression_predict

```
mlpack_linear_regression_predict[M, F, N]
```

Given a linear regression model trained with `mlpack_linear_regression[]`

, make
predictions on a test set.

See also the mlpack
documentation
and the documentation for `mlpack_linear_regression[]`

for more details.

Inputs:

`M`

: linear regression model to use for prediction; must be the result of a previous`mlpack_linear_regression[]`

call`F`

: relation of test features for which predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_linear_regression_predict[M, F, N] =`

ext_ml_predict[:mlpack_linear_regression_predict, M, F, N]

## mlpack_linear_svm

```
mlpack_linear_svm[F, R, H]
```

An implementation of linear SVM for multiclass classification. Given labeled data, a model can be trained and saved for future use.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`delta`

(`Float64`

): margin of difference between correct class and other classes (default`1.0`

).`epochs`

(`Int`

): maximum number of full epochs over dataset for psgd (default`50`

).`lambda`

(`Float64`

): L2-regularization parameter for training (default`0.0001`

).`max_iterations`

(`Int`

): Maximum iterations for optimizer (`0`

indicates no limit). Default`10000`

.`no_intercept`

(`Bool`

): Do not add the intercept term to the model (default`false`

).`num_classes`

(`Int`

): Number of classes for classification; if unspecified (or 0), the number of classes found in the labels will be used. Default`0`

.`optimizer`

(`String`

): Optimizer to use for training (`"lbfgs"`

or`"psgd"`

). Default`"lbfgs"`

.`seed`

(`Int`

): Random seed. If`0`

,`std::time(NULL)`

is used. Default`0`

.`shuffle`

(`Bool`

): If true, don’t shuffle the order in which data points are visited for parallel SGD. Default`false`

.`step_size`

(`Float64`

): Step size for parallel SGD optimizer. Default`0.01`

.`tolerance`

(`Float64`

): Convergence tolerance for optimizer. Default`1e-10`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

#### Definition

`@inline def mlpack_linear_svm[F, R, H] = ext_ml_train[:mlpack_linear_svm, F, R, H]`

## mlpack_linear_svm_predict

```
mlpack_linear_svm_predict[F, R, H]
```

Given a linear SVM model trained with `mlpack_linear_svm[]`

, make predictions on a test
set.

See also the mlpack
documentation and
the documentation for `mlpack_linear_svm[]`

for more details.

Inputs:

`M`

: linear SVM model to use for prediction; must be the result of a previous`mlpack_linear_svm[]`

call`F`

: relation of test features for which predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_linear_svm_predict[M, F, N] =`

ext_ml_predict[:mlpack_linear_svm_predict, M, F, N]

## mlpack_logistic_regression

```
mlpack_logistic_regression[F, R, H]
```

An implementation of L2-regularized logistic regression for two-class classification. Given labeled data, a model can be trained and saved for future use.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`batch_size`

(`Int`

): Batch size for SGD. Default`64`

.`decision_boundary`

(`Float64`

): Decision boundary for prediction; if the logistic function for a point is less than the boundary, the class is taken to be`1`

; otherwise, the class is`2`

. Default`0.5`

.`lambda`

(`Float64`

): L2-regularization parameter for training. Default`0`

.`max_iterations`

(`Int`

): Maximum iterations for optimizer (`0`

indicates no limit). Default`10000`

.`optimizer`

(`String`

): Optimizer to use for training (`"lbfgs"`

or`"sgd"`

). Default`"lbfgs"`

.`step_size`

(`Float64`

): Step size for SGD optimizer. Default`0.01`

.`tolerance`

(`Float64`

): Convergence tolerance for optimizer. Default`1e-10`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

#### Definition

`@inline def mlpack_logistic_regression[F, R, H] =`

ext_ml_train[:mlpack_logistic_regression, F, R, H]

## mlpack_logistic_regression_predict

```
mlpack_logistic_regression_predict[F, R, H]
```

Given a logistic regression model trained with `mlpack_logistic_regression[]`

, make
class predictions on a test set.

See also the mlpack
documentation
and the documentation for `mlpack_logistic_regression[]`

for more details.

Inputs:

`M`

: logistic regression model to use for class predictions; must be the result of a previous`mlpack_logistic_regression[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_logistic_regression_predict[M, F, N] =`

ext_ml_predict[:mlpack_logistic_regression_predict, M, F, N]

## mlpack_lsh

```
mlpack_lsh[K, M, Q, N, H]
```

Perform approximate k-nearest-neighbor search on a relation `Q`

containing query points,
using a model `M`

that was built with `mlpack_lsh_build[]`

.

See also the mlpack
documentation and the
documentation for `mlpack_lsh_build[]`

for more details.

Inputs:

`K`

: constant representing number of nearest neighbors to search for.`M`

: pre-trained model for kNN; must be the result of a previous`mlpack_knn_build[]`

call.`Q`

: relation of query points; must have the same number of keys as the relation that`M`

was built with.`N`

: constant indicating the number of arguments in`Q`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`num_probes`

(`Int`

): Number of additional probes for multiprobe LSH; if 0, traditional LSH is used. Default 0.`verbose`

(`Bool`

): Display informational messages.

Result:

- A relation mapping keys from
`Q`

to keys in the reference set that the model`M`

was built on. The form is (query_keys…, k, reference_keys…, distance) where`k`

takes values between`1`

and`K`

for each possible set of`query_keys...`

. Given`query_keys...`

and`k`

, then`reference_keys...`

is the set of keys associated with the`k`

‘th nearest neighbor, and`distance`

is the Euclidean distance between the point associated with`query_keys...`

and the point associated with`reference_keys...`

.

#### Definition

`@inline def mlpack_lsh[K, M, Q, N, H] = ext_ml_transform[:mlpack_lsh, K, M, Q, N, H]`

## mlpack_lsh_build

```
mlpack_lsh_build[R, N, H]
```

An implementation of approximate k-nearest-neighbor search with locality-sensitive hashing (LSH). Given a set of reference points, this will build an LSH model.

See also the mlpack
documentation and the
documentation for `mlpack_lsh[]`

for more details.

Inputs:

`R`

: relation of reference points that tree should be built on`N`

: constant indicating the number of arguments in`R`

that correspond to keys (i.e. dimensions that should not be considered when building the model).`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`bucket_size`

(`Int`

): The size of a bucket in the second level hash. Default`500`

.`hash_width`

(`Float64`

): The hash width for the first-level hashing in the LSH preprocessing. By default, the LSH class automatically estimates a hash width for its use.`projections`

(`Int`

): The number of hash functions for each table. Default`10`

.`second_hash_size`

(`Int`

): The size of the second level hash table. Default`99901`

.`seed`

(`Int`

): Random seed. If`0`

, ‘std::time(NULL)’ is used. Default`0`

.`tables`

(`Int`

): The number of hash tables to be used. Default`30`

.`verbose`

(`Bool`

): Display informational messages.

Result:

- An LSH model that can be used in a later call to
`mlpack_lsh[]`

.

#### Definition

`@inline def mlpack_lsh_build[R, N, H] = ext_ml_build[:mlpack_lsh, R, N, H]`

## mlpack_mean_shift

```
mlpack_mean_shift[F, N, H]
```

A clustering of the data using the mean shift algorithm. Uses a fast implementation of mean-shift clustering using dual-tree range search.

See the mlpack documentation for more details.

Inputs:

`F`

: relation of data points to cluster.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when clustering the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`force_convergence`

(`Bool`

): If specified, the mean shift algorithm will continue running regardless of`max_iterations`

until the clusters converge. Default`false`

.`max_iterations`

(`Int`

): Maximum number of iterations before mean shift terminates. Default`1000`

.`radius`

(`Float64`

): If the distance between two centroids is less than the given radius, one will be removed. A radius of`0`

or less means an estimate will be calculated and used for the radius. Default`0`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation containing the keys in
`F`

with a cluster assignment (`Int`

) as the last element. If the key was not assigned to a cluster, the cluster assignment will be`0`

.

#### Definition

`@inline def mlpack_mean_shift[F, N, H] =`

ext_ml_transform[:mlpack_mean_shift, 0, {()}, F, N, H]

## mlpack_nbc

```
mlpack_nbc[F, R, H]
```

An implementation of the Naive Bayes Classifier, used for classification. Given labeled data, an NBC model can be trained.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`incremental_variance`

(`Bool`

): The variance of each class will be calculated incrementally.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and timers at the end of execution.

#### Definition

`@inline def mlpack_nbc[F, R, H] = ext_ml_train[:mlpack_nbc, F, R, H]`

## mlpack_nbc_predict

```
mlpack_nbc_predict[M, F, N]
```

Given a Naive Bayes classifier model trained with `mlpack_nbc[]`

, make class predictions
on a test set.

See also the mlpack
documentation and the
documentation for `mlpack_nbc[]`

for more details.

Inputs:

`M`

: Naive Bayes classification model to use for prediction; must be the result of a previous`mlpack_nbc[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_nbc_predict[M, F, N] = ext_ml_predict[:mlpack_nbc_predict, M, F, N]`

## mlpack_nmf

```
mlpack_nmf[R, F, N, H]
```

An implementation of non-negative matrix factorization. This can be used to decompose an input dataset into two low-rank non-negative components.

See also the mlpack documentation for more details.

Inputs:

`R`

: constant indicating the rank of the low-rank decomposition.`F`

: relation of features to decompose into two low-rank matrices.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`max_iterations`

(`Int`

): Number of iterations before NMF terminates (0 runs until convergence). Default`10000`

.`min_residue`

(`Float64`

): The minimum root mean square residue allowed for each iteration, below which the program terminates. Default`1e-05`

.`seed`

(`Int`

): Random seed. If`0`

,`std::time(NULL)`

is used. Default`0`

.`update_rules`

(`String`

): Update rules for each iteration; (`"multdist"`

|`"multdiv"`

|`"als"`

). Default`"multdist"`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation encoding
*both*the low-rank H and W matrices. However, it is slightly confusing because rows in`W`

are keyed by the first`N`

arguments of`F`

, but rows in`H`

are keyed by integers. Thus, the first argument of the output is either`1`

if the tuple corresponds to a row of`W`

or`2`

if the tuple corresponds to a row of`H`

. Then, if the first argument is`1`

, then the next`N`

arguments are keys from`F`

; otherwise they are zero values that should be ignored. After that, if the first argument is`2`

, the next argument is the (`Int`

) row index for tuples pertaining to`H`

; otherwise they are zero values that should be ignored. The following argument is the (`Int`

) index of the argument that the tuple pertains to in`W`

or`H`

. The last argument is the (`Float64`

) value in either`W`

or`H`

referenced by the previous arguments.

In some sense, the format of the result can be understood as an “interleaved sparse representation” of W and H. We are forced to do this in part because Rel cannot currently return two relations easily from one call.

#### Definition

`@inline def mlpack_nmf[R, F, N, H] = ext_ml_transform[:mlpack_nmf, R, {()}, F, N, H]`

## mlpack_pca

```
mlpack_pca[D, F, N, H]
```

An implementation of several strategies for principal components analysis (PCA), a common preprocessing step. Given a dataset and a desired new dimensionality, this can reduce the dimensionality of the data using the linear transformation determined by PCA.

See also the mlpack documentation for more details.

Input options:

`D`

: constant indicating the desired new dimensionality of the data.`F`

: relation of features to perform PCA on.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`decomposition_method`

(`String`

): Method used for the principal components analysis:`"exact"`

,`"randomized"`

,`"randomized-block-krylov"`

,`"quic"`

. Default`"exact"`

.`scale`

(`Bool`

): If set, the data will be scaled before running PCA, such that the variance of each feature is`1`

. Default`false`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation mapping keys in
`F`

(i.e. the first`N`

arguments of`F`

) to`D`

values in each dimension.

#### Definition

`@inline def mlpack_pca[D, F, N, H] = ext_ml_transform[:mlpack_pca, D, {()}, F, N, H]`

## mlpack_perceptron

```
mlpack_perceptron[F, R, H]
```

An implementation of a perceptron—a single level neural network—for classification. Given labeled data, a perceptron can be trained.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`max_iterations`

(`Int`

): The maximum number of iterations the perceptron is to be run. Default`1000`

.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and

#### Definition

`@inline def mlpack_perceptron[F, R, H] = ext_ml_train[:mlpack_perceptron, F, R, H]`

## mlpack_perceptron_predict

```
mlpack_perceptron_predict[M, F, N]
```

Given a perceptron model trained with `mlpack_perceptron[]`

, make class predictions on a
test set.

See also the mlpack
documentation and
the documentation for `mlpack_perceptron[]`

for more details.

Inputs:

`M`

: Perceptron model to use for prediction; must be the result of a previous`mlpack_perceptron[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_perceptron_predict[M, F, N] =`

ext_ml_predict[:mlpack_perceptron_predict, M, F, N]

## mlpack_preprocess_split

```
mlpack_preprocess_split[F, H]
```

This utility takes a dataset and splits it into a training set and a test set. Before the split, the points in the dataset are randomly reordered. The percentage of the dataset to be used as the test set can be specified with the test_ratio parameter; the default is 0.2 (20%).

See also the mlpack documentation for more details.

Input options:

`F`

: relation of features to split. If you want to split labels too, they should be included in this relation.`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`no_shuffle`

(`Bool`

): Avoid shuffling the data before splitting. Default`false`

.`seed`

(`Int`

): Random seed (`0`

for`std::time(NULL)`

). Default`0`

.`test_ratio`

(`Float64`

): Ratio of test set; if not set, the ratio defaults to`0.2`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation
`F`

with membership in the training or test set prepended. So, if`(t...)`

was a tuple in`F`

,`(set, t...)`

will be included where`set`

is`1`

if the point`t...`

is a part of the training set, and`set`

is`2`

if the point is a part of the test set.

#### Definition

`@inline def mlpack_preprocess_split[F, H] =`

ext_ml_transform[:mlpack_preprocess_split, 0, {()}, F, 0, H]

## mlpack_radical

```
mlpack_radical[F, N, H]
```

An implementation of RADICAL, a method for independent component analysis (ICA). Given a dataset, this can decompose the dataset into an independent component matrix; this can be useful for preprocessing.

See also the mlpack documentation for more details.

Input options:

`F`

: relation of features to perform RADICAL on.`N`

: constant indicating the number of arguments in`F`

that correspond to keys (i.e. dimensions that should not be considered when transforming the data).`H`

: relation of hyperparameters encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`angles`

(`Int`

): Number of angles to consider in brute-force search during Radical2D. Default`150`

.`noise_std_dev`

(`Float64`

): Standard deviation of Gaussian noise. Default`0.175`

.`objective`

(`Bool`

): If set, an estimate of the final objective function is printed. Default`false`

.`replicates`

(`Int`

): Number of Gaussian-perturbed replicates to use (per point) in Radical2D. Default`30`

.`sweeps`

(`Int`

): Number of sweeps; each sweep calls Radical2D once for each pair of dimensions. Default`0`

.`verbose`

(`Bool`

): Display informational messages. Default`false`

.

Result:

- A relation mapping keys in
`F`

(i.e. the first`N`

arguments of`F`

) to independent component values in each dimension.

#### Definition

`@inline def mlpack_radical[F, N, H] = ext_ml_transform[:mlpack_radical, 0, {()}, F, N, H]`

## mlpack_random_forest

```
mlpack_random_forest[F, R, H]
```

An implementation of the standard random forest algorithm by Leo Breiman for classification. Given labeled data, a random forest can be trained.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`maximum_depth`

(`Int`

): Maximum depth of the tree (`0`

means no limit). Default`0`

.`minimum_gain_split`

(`Float64`

): Minimum gain needed to make a split when building a tree. Default`0.0`

.`minimum_leaf_size`

(`Int`

): Minimum number of points in each leaf node. Default`1`

.`num_trees`

(`Int`

): Number of trees in the random forest. Default`10`

.`print_training_accuracy`

(`Bool`

): If set, then the accuracy of the model on the training set will be predicted (verbose must also be specified).`seed`

(`Int`

): Random seed. If`0`

, ‘std::time(NULL)’ is used. Default`0`

.`subspace_dim`

(`Int`

): Dimensionality of random subspace to use for each split.`0`

will autoselect the square root of data dimensionality. Default`0`

.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and

#### Definition

`@inline def mlpack_random_forest[F, R, H] =`

ext_ml_train[:mlpack_random_forest, F, R, H]

## mlpack_random_forest_predict

```
mlpack_random_forest_predict[M, F, N]
```

Given a random forest model trained with `mlpack_random_forest[]`

, make class predictions
on a test set.

See also the mlpack
documentation and
the documentation for `mlpack_random_forest[]`

for more details.

Inputs:

`M`

: random forest model to use for prediction; must be the result of a previous`mlpack_random_forest[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_random_forest_predict[M, F, N] =`

ext_ml_predict[:mlpack_random_forest_predict, M, F, N]

## mlpack_softmax_regression

```
mlpack_softmax_regression[F, R, H]
```

An implementation of softmax regression for classification, which is a multiclass generalization of logistic regression. Given labeled data, a softmax regression model can be trained and saved for future use.

See also the mlpack documentation for more details.

Inputs:

`F`

: relation of features to learn on`R`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`lambda`

(`Float64`

): L2-regularization constant. Default`0.0001`

.`max_iterations`

(`Int`

): Maximum number of iterations before termination. Default`400`

.`no_intercept`

(`Bool`

): Do not add the intercept term to the model.`number_of_classes`

(`Int`

): Number of classes for classification; if unspecified (or`0`

), the number of classes found in the labels will be used. Default`0`

.`verbose`

(`Bool`

): Display informational messages and the full list of parameters and

#### Definition

`@inline def mlpack_softmax_regression[F, R, H] =`

ext_ml_train[:mlpack_softmax_regression, F, R, H]

## mlpack_softmax_regression_predict

```
mlpack_softmax_regression_predict[M, F, N]
```

Given a softmax regression model trained with `mlpack_softmax_regression[]`

, make class
predictions on a test set.

See also the mlpack
documentation
and the documentation for `mlpack_softmax_regression[]`

for more details.

Inputs:

`M`

: softmax regression model to use for prediction; must be the result of a previous`mlpack_softmax_regression[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def mlpack_softmax_regression_predict[M, F, N] =`

ext_ml_predict[:mlpack_softmax_regression_predict, M, F, N]

## xgboost_classifier

```
xgboost_classifier[F, L, H]
```

A binding of the xgboost() function to train an XGBoost model (via XGBoost.jl). This fits a
boosted tree model with the XGBoost algorithm to the features `F`

and labels `L`

, using
hyperparameters specified in the relation `H`

.

If you would like to train a regression model with XGBoost, see `xgboost_regressor[]`

.

See also the XGBoost documentation for each hyperparameter.

Note that there are *very many hyperparameters*… all are optional. Here, we only provide
documentation for common parameters. More documentation on each of these parameters is
available in the link above, as well as documentation for many other less common
hyperparameters not listed here.

Inputs:

`F`

: relation of features to learn on`L`

: relation of labels; the last variable should be the label; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters (incomplete list):

`num_round`

(`Int`

): Number of rounds of boosting to perform. (Default`50`

.)`booster`

(`String`

): Which booster to use. Can be`"gbtree"`

,`"gblinear"`

or`"dart"`

;`"gbtree"`

and`"dart"`

use tree based models while`"gblinear"`

uses linear functions. (Default `“gbtree”.)`verbosity`

(`Int`

): Verbosity of printing messages. Valid values are`0`

(silent),`1`

(warning),`2`

(info),`3`

(debug). (Default`1`

.)`objective`

(`String`

): Specify the learning task and the corresponding learning objective. Valid options include`"binary:logistic"`

,`"binary:hinge"`

,`"multi:softmax"`

, and other classification objectives listed in the XGBoost documentation. (Default`"multi:softmax"`

.)`base_score`

(`Float64`

): The initial prediction score of all instances. (Default`0.5`

.)`eval_metric`

(`String`

): Evaluation metrics for validation data. Valid choices include`"merror"`

,`"error"`

,`"logloss"`

,`"auc"`

,`"aucpr"`

,`"ndcg"`

,`"map"`

, and other classification evaluation metrics specified in the XGBoost documentation. (Default set based on`objective`

value.)`seed`

(`Int`

): Random number seed. (Default`0`

.)`eta`

(`Float64`

): Step size shrinkage used in update to prevent overfitting. (Default`0.3`

.)`gamma`

(`Float64`

): Minimum loss reduction required to make a further partition on a leaf node of the tree. (Default`0.0`

.)`max_depth`

(`Int`

): Maximum depth of a tree. (Default`6`

.)`min_child_weight`

(`Float64`

): Minimum sum of instance weight (hessian) needed in a child. (Default`1.0`

.)`max_delta_step`

(`Float64`

): Maximum delta step we allow each leaf output to be. If the value is set to`0`

, it means there is no constraint. (Default`0.0`

.)

#### Definition

`@inline def xgboost_classifier[F, L, H] = ext_ml_train[:xgboost_classifier, F, L, H]`

## xgboost_classifier_predict

```
xgboost_classifier_predict[M, F, N]
```

Given an XGBoost classification model trained with `xgboost_classifier[]`

, make class
predictions on a test set.

For more information, see the documentation for `xgboost_classifier[]`

and the XGBoost
documentation.

Inputs:

`M`

: XGBoost classification model to use for prediction; must be the result of a previous`xgboost_classifier[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def xgboost_classifier_predict[M, F, N] =`

ext_ml_predict[:xgboost_classifier_predict, M, F, N]

## xgboost_classifier_probabilities

```
xgboost_classifier_probabilities[M, F, N]
```

Given an XGBoost classification model trained with `xgboost_classifier[]`

, compute the
probabilities of each class for each point in `F`

.

Note that `M`

must be an XGBoost classification model trained with the `binary:logistic`

or
`multi:softprob`

objectives.

For more information, see the documentation for `xgboost_classifier[]`

and the XGBoost
documentation.

Inputs:

`M`

: XGBoost classification model to use for prediction; must be the result of a previous`xgboost_classifier[]`

call`F`

: relation of test features for which class predictions will be computed`N`

: constant Int representing the number of keys in`F`

Result:

- A relation
`probabilities(keys..., class, prob)`

where`keys...`

are the keys of each point in`F`

,`class`

takes values for every class in`M`

, and`prob`

is the probability of that class for those keys.

#### Definition

`@inline def xgboost_classifier_probabilities[M, F, N] =`

ext_ml_transform[:xgboost_classifier_probabilities, 0, M, F, N, {}]

## xgboost_feature_importances

```
xgboost_feature_importances[M, F]
```

Given an XGBoost model trained with `xgboost_classifier[]`

or `xgboost_regressor[]`

and the
feature module `F`

that it was trained with (or an equivalent feature module with the same
feature names), return an arity-2 relation mapping feature names (as `String`

s) to feature
importance values.

Note that this relation may be empty if feature importance cannot be computed! (This could happen, for instance, if the model’s trees don’t have any splits at all.)

For more information, see the documentation for `xgboost_classifier[]`

,
`xgboost_regressor[]`

, and the `importances()`

function from
XGBoost.jl.

Inputs:

`M`

: XGBoost classification or regression model; must be the result of a previous`xgboost_classifier[]`

or`xgboost_regressor[]`

call.`F`

: relation containing all of the same features that the model was trained on`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters:

`type`

: type of feature importance to return; valid options are “gain”, “cover”, and “freq”. Default “gain”.

#### Definition

`@inline def xgboost_feature_importances[M, F, H](f_str, imp) = exists(i :`

// We need to get a sorted list of specializations in F, but to do this we must

// convert them to strings (we cannot sort symbols at the moment).

sort[f : exists(f_sym, xs... : F(f_sym, xs...) and f = string[f_sym])](i, f_str) and

// :xgboost_feature_importances produces mappings (i => importance) for integer i,

// and these will match the sorted index of the feature names. It's possible we might

// not get a feature back! That means we need to insert a default value.

(

ext_ml_transform[:xgboost_feature_importances, 0, M, {()}, 0, H](i, imp) or

(not exists(v : ext_ml_transform[

:xgboost_feature_importances, 0, M, {()}, 0, H

](i, v)) and imp = 0.0)

)

)

## xgboost_regressor

```
xgboost_regressor[F, R, H]
```

A binding of the xgboost() function to train an XGBoost regression model (via XGBoost.jl).
This fits a boosted tree model with the XGBoost algorithm to the features `F`

and responses
`R`

, using hyperparameters specified in the relation `H`

.

If you would like to train a classification model with XGBoost, see `xgboost_classifier[]`

.

See also the XGBoost documentation for each hyperparameter.

Note that there are *very many hyperparameters*… all are optional. Here, we only provide
documentation for common parameters. More documentation on each of these parameters is
available in the link above, as well as documentation for many other less common
hyperparameters not listed here.

Inputs:

`F`

: relation of features to learn on`L`

: relation of responses; the last variable should be the response; everything else should be keys`H`

: relation of hyperparameters, encoded as`(String, String)`

; e.g.,`{("param1", "10"); ("param2", "true")}`

Hyperparameters (incomplete list):

`num_round`

(`Int`

): Number of rounds of boosting to perform. (Default`50`

.)`booster`

(`String`

): Which booster to use. Can be`"gbtree"`

,`"gblinear"`

or`"dart"`

;`"gbtree"`

and`"dart"`

use tree based models while`"gblinear"`

uses linear functions. (Default`"gbtree"`

.)`verbosity`

(`Int`

): Verbosity of printing messages. Valid values are`0`

(silent),`1`

(warning),`2`

(info),`3`

(debug). (Default`1`

.)`objective`

(`String`

): Specify the learning task and the corresponding learning objective. Valid options include`"reg:squarederror"`

,`"reg:squaredlogerror"`

,`"reg:logistic"`

,`"reg:psuedohubererror"`

,`"reg:gamma"`

,`"reg:tweedie"`

, and other regression objectives listed in the XGBoost documentation. (Default`"reg:squarederror"`

.)`base_score`

(`Float64`

): The initial prediction score of all instances. (Default`0.5`

.)`eval_metric`

(`String`

): Evaluation metrics for validation data. Valid choices include`"rmse"`

,`"rmsle"`

,`"mae"`

,`"mape"`

,`"mphe"`

, and other regression evaluation metrics specified in the XGBoost documentation. (Default set based on`objective`

value.)`seed`

(`Int`

): Random number seed. (Default`0`

.)`eta`

(`Float64`

): Step size shrinkage used in update to prevent overfitting. (Default`0.3`

.)`gamma`

(`Float64`

): Minimum loss reduction required to make a further partition on a leaf node of the tree. (Default`0.0`

.)`max_depth`

(`Int`

): Maximum depth of a tree. (Default`6`

.)`min_child_weight`

(`Float64`

): Minimum sum of instance weight (hessian) needed in a child. (Default`1.0`

.)`max_delta_step`

(`Float64`

): Maximum delta step we allow each leaf output to be. If the value is set to`0`

, it means there is no constraint. (Default`0.0`

.)

#### Definition

`@inline def xgboost_regressor[F, R, H] = ext_ml_train[:xgboost_regressor, F, R, H]`

## xgboost_regressor_predict

```
xgboost_regressor_predict[M, F, N]
```

Given an XGBoost regression model trained with `xgboost_regressor[]`

, make regression
predictions on a test set.

For more information, see the documentation for `xgboost_regressor[]`

and the XGBoost
documentation.

Inputs:

`M`

: XGBoost regression model to use for prediction; must be the result of a previous`xgboost_regressor[]`

call`F`

: relation of test features for which regression predictions will be computed`N`

: constant Int representing the number of keys in`F`

#### Definition

`@inline def xgboost_regressor_predict[M, F, N] =`

ext_ml_predict[:xgboost_regressor_predict, M, F, N]