Rel
REFERENCE
Libraries
ml

# The Machine Learning Library (ml)

Machine learning bindings for mlpack, GLM, XGboost.

## glm_generic#

glm_generic[F, R, H]

A binding of the GLM.jl function glm. Fits a generalized linear model given features F and responses R and family and link passed in hyperparameters H. The supported families and links are listed below.

Input options:

• F: Relation of features to perform a GLM regression on.
• R: Relation of responses to train the GLM regression model on.
• H: Relation of hyperparameters to specify the family and link to use to generate the generalized linear model. Example: H = {("family","Normal"); ("link","IdentityLink")}. Families supported: ["Binomial", "Bernoulli", "Binomial", "Gamma", "InverseGaussian", "NegativeBinomial", "Normal", "Poisson"]. Links supported: ["CauchitLink", "CloglogLink", "IdentityLink", "InverseLink", "InverseSquareLink", "LogitLink", "LogLink", "ProbitLink", "SqrtLink"].

Result:

• A GLM model that can later be used with glm_predict[].

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_generic[features, responses, hyperparams]

Definition

@inline def glm_generic[F, R, H] = ext_ml_train[:glm_generic, F, R, H]

## glm_linear_regression#

glm_linear_regression[F, R]

A binding of the GLM.jl function lm. Fits a linear regression model given features F and responses R.

Note that this is unregularized linear regression, so if your model does not converge (e.g. training gives a PosDefException), try using regularized linear regression, perhaps via mlpack_linear_regression[] with the lambda hyperparameter set, or ensure that the columns of your data are not linearly dependent.

Input options:

• F: Relation of features to perform linear regression on.
• R: Relation of responses to train the linear regression model on.

Result:

• A GLM model that can later be used with glm_predict.

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_linear_regression[features, responses]

Definition

@inline def glm_linear_regression[F, R] = ext_ml_train[:glm_generic, F, R,
{ a, b : (a = "family" and b = "Normal") or (a = "link" and b = "IdentityLink")}
]

## glm_logistic_regression#

glm_logistic_regression[F, R]

A binding of the GLM.jl function glm with the binomial family and Logit link. Fits a logistic regression model given features F and responses R.

Input options:

• F: Relation of features to perform logistic regression on.
• R: Relation of responses to train the logistic regression model on.

Result:

• A GLM model that can later be used with glm_predict.

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_logistic_regression[features, responses]

Definition

@inline def glm_logistic_regression[F, R] = ext_ml_train[:glm_generic, F, R,
{ a, b : (a = "family" and b = "Binomial") or (a = "link" and b = "LogitLink")}
]

## glm_predict#

glm_predict[M, F, N]

A binding of the GLM.jl function predict. Uses a generalized linear model M to generate predictions for features F. Here, M can be produced from any of the definitions glm_linear_regression, glm_logistic_regression, glm_probit_regression, or glm_generic.

Input options:

• M: Relation containing the model generated by running one of the generalized linear models previously (e.g. glm_linear_regression or glm_generic).
• F: Relation of features to generate the predictions given the previously computed model.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when computing predictions).

Result:

• Predictions of the features F after being fit with the model M.

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_probit_regression[features, responses]
def predictions = glm_predict[model, features, 1]

Definition

@inline def glm_predict[M, F, N] = ext_ml_predict[:glm_predict, M, F, N]

## glm_probit_regression#

glm_probit_regression[F, R]

A binding of the GLM.jl function glm with the binomial family and Probit link. Fits a probit regression model given features F and responses R.

Input options:

• F: Relation of features to perform probit regression on.
• R: Relation of responses to train the probit regression model on.

Result:

• A GLM model that can later be used with glm_predict.

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_probit_regression[features, responses]

Definition

@inline def glm_probit_regression[F, R] = ext_ml_train[:glm_generic, F, R,
{ a, b : (a = "family" and b = "Binomial") or (a = "link" and b = "ProbitLink")}
]

mlpack_adaboost[F, R, H]

An implementation of the AdaBoost.MH (Adaptive Boosting) algorithm for classification. This can be used to train an AdaBoost model on labeled data.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• iterations (Int): The maximum number of boosting iterations to be run (0 will run until convergence.) Default 1000.
• tolerance (Float64): The tolerance for change in values of the weighted error during training. Default 1e-10.
• verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.
• weak_learner (String): The type of weak learner to use: decision_stump, or perceptron. Default decision_stump.

Definition

@inline def mlpack_adaboost[F, R, H] = ext_ml_train[:mlpack_adaboost, F, R, H]

mlpack_adaboost_predict[M, F, N]

Given an AdaBoost.MH model trained with mlpack_adaboost[, make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_adaboost[] for more details.

Inputs:

• M: AdaBoost model to use for prediction; must be the result of a previous mlpack_adaboost[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_adaboost_predict[M, F, N] =
ext_ml_predict[:mlpack_adaboost_predict, M, F, N]

## mlpack_approx_kfn#

mlpack_approx_kfn[K, M, Q, N, H]

Perform approximate k-furthest-neighbor search on a relation Q containing query points, using a model M that was build with mlpack_approx_kfn_build[].

Inputs:

• K: constant representing number of nearest neighbors to search for.
• M: pre-trained model for kNN; must be the result of a previous mlpack_knn_build[] call.
• Q: relation of query points; must have the same number of keys as the relation that M was built with.
• N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• calculate_error (Bool): If set, calculate and display the average distance error for the first furthest neighbor only.
• verbose (Bool): Display informational messages.

Result:

• A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (querykeys…, k, referencekeys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th approximate furthest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_approx_kfn[K, M, Q, N, H] =
ext_ml_transform[:mlpack_approx_kfn, K, M, Q, N, H]

## mlpack_approx_kfn_build#

mlpack_approx_kfn_build[R, N, H]

An implementation of two strategies for furthest neighbor search. This creates a furthest neighbor search model that can be reused later.

See also the mlpack documentation and the documentation for mlpack_approx_kfn[] for more details.

Inputs:

• R: relation of reference points that tree should be built on
• N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• algorithm (String): Algorithm to use: "ds" or "qdafn". Default "ds".
• num_projections (Int): Number of projections to use in each hash table. Default 5.
• num_tables (Int): Number of hash tables to use. Default 5.
• verbose (Bool): Display informational messages.

Result:

• An approximate KFN model that can be used in a later call to mlpack_approx_kfn[].

Definition

@inline def mlpack_approx_kfn_build[R, N, H] = ext_ml_build[:mlpack_approx_kfn, R, N, H]

## mlpack_dbscan#

mlpack_dbscan[F, N, H]

A clustering of the dataset F using DBSCAN clustering with parameters N and H.

See the mlpack documentation for more details.

Inputs:

• F: relation of data points to cluster.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• epsilon (Float64): Radius of each range search. Default 1.
• min_size (Int): Minimum number of points for a cluster. Default 5.
• naive (Bool): If set, brute-force range search (not tree-based) will be used. Default false.
• selection_type (String): If using point selection policy, the type of selection to use ("ordered", "random"). Default "ordered".
• single_mode (Bool): If set, single-tree range search (not dual-tree) will be used. Default false.
• tree_type (String): If using single-tree or dual-tree search, the type of tree to use ("kd", "r", "r-star", "x", "hilbert-r", "r-plus", "r-plus-plus", "cover", "ball"). Default "kd".
• verbose (Bool): Display informational messages.

Result:

• A relation containing the keys in F, with a cluster assignment (Int) as the last argument. If the point is considered “noise” (i.e. not part of any cluster), the cluster assignment is 0.

Definition

@inline def mlpack_dbscan[F, N, H] = ext_ml_transform[:mlpack_dbscan, 0, {()}, F, N, H]

## mlpack_decision_tree#

mlpack_decision_tree[F, R, H]

An implementation of an ID3-style decision tree for classification, which supports categorical data. This binding accepts categorical features in F; a feature in F is interpreted as categorical if it is an entity or has String type.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• maximum_depth (Int): Maximum depth of the tree (0 means no limit). Default 0.
• minimum_gain_split (Float64): Minimum gain for node splitting. Default 1e-7.
• minimum_leaf_size (Int): Minimum number of points in a leaf. Default 20.
• print_training_accuracy (Bool): Print the training accuracy. Default false.
• verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_decision_tree[F, R, H] = ext_ml_train[:mlpack_decision_tree, F, R, H]

## mlpack_decision_tree_predict#

mlpack_decision_tree_predict[M, F, N]

Given a decision tree model trained with mlpack_decision_tree[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_decision_tree[] for more details.

Inputs:

• M: decision tree model to use for prediction; must be the result of a previous mlpack_decision_tree[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_decision_tree_predict[M, F, N] =
ext_ml_predict[:mlpack_decision_tree_predict, M, F, N]

## mlpack_det#

mlpack_det[M, F, N, H]

Given a DET trained with mlpack_det_build[], compute densities of the query points in the relation F.

See also the mlpack documentation and the documentation for mlpack_det[] for more details.

Inputs:

• M: pre-trained DET model; must be the result of a previous mlpack_det_build[] call.
• F: relation of features to compute density estimates for.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• verbose (Bool): Display informational messages. Default false.

Result:

• A relation mapping keys from F (i.e. the first N elements of the tuples in F) to their density estimates.

Definition

@inline def mlpack_det[M, F, N, H] = ext_ml_transform[:mlpack_det, 0, M, F, N, H]

## mlpack_det_build#

mlpack_det_build[F, N, H]

An implementation of density estimation trees for the density estimation task. Density estimation trees can be trained with this native.

See also the mlpack documentation and the documentation for mlpack_det[] for more details.

Inputs:

• F: relation of features to build density estimation tree on.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• folds (Int): The number of folds of cross-validation to perform for the estimation (0 is LOOCV). Default 10.
• max_leaf_size (Int): The maximum size of a leaf in the unpruned, fully grown DET. Default 10.
• min_leaf_size (Int): The minimum size of a leaf in the unpruned, fully grown DET. Default 5.
• skip_pruning (Bool): Whether to bypass the pruning process and output the unpruned tree only. Default false.
• verbose (Bool): Display informational messages. Default false.

Definition

@inline def mlpack_det_build[F, N, H] = ext_ml_build[:mlpack_det, F, N, H]

## mlpack_emst#

mlpack_emst[F, N, H]

An implementation of the Dual-Tree Boruvka algorithm for computing the Euclidean minimum spanning tree of a set of input points.

Inputs:

• F: relation of data points to compute the minimum spanning tree of.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• leaf_size (Int): Leaf size in the kd-tree. One-element leaves give the empirically best performance, but at the cost of greater memory requirements. Default 1.
• naive (Bool): Compute the MST using O(n^2) naive algorithm. Default false.
• verbose (Bool): Display informational messages. Default false.

Result:

• An ordered edge relation with weights. Specifically, each point in F is associated with a set of N keys. The first argument of the output relation is the index of the edge (starting from 1); lower-weighted edges have lower indices. The following N arguments of the output relation correspond to the first vertex; the following N arguments of the output relation correspond to the second vertex; and the last argument represents the distance between those two vertices.

Definition

@inline def mlpack_emst[F, N, H] = ext_ml_transform[:mlpack_emst, 0, {()}, F, N, H]

## mlpack_fastmks#

mlpack_fastmks[K, M, Q, N, H]

Perform max-kernel search search on a relation Q containing query points, using a model M that was built with mlpack_fastmks_build[].

See also the mlpack documentation and the documentation for mlpack_fastmks_build[] for more details.

Inputs:

• K: constant representing number of max kernels to search for.
• M: pre-trained model for kNN; must be the result of a previous mlpack_fastmks_build[] call.
• Q: relation of query points; must have the same number of keys as the relation that M was built with.
• N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• verbose (Bool): Display informational messages.

Result:

• A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (querykeys…, k, referencekeys…, kernel) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th max-kernel, and kernel is the kernel value between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_fastmks[K, M, Q, N, H] = ext_ml_transform[:mlpack_fastmks, K, M, Q, N, H]

## mlpack_fastmks_build#

mlpack_fastmks_build[R, N, H]

An implementation of max-kernel search using single-tree and dual-tree algorithms. Given a set of reference points and query points, this can build trees that can be used in later calls to mlpack_fastmks[].

See also the mlpack documentation and the documentation for mlpack_fastmks[] for more details.

Inputs:

• R: relation of reference points that tree should be built on
• N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• bandwidth (Float64): Bandwidth (for Gaussian, Epanechnikov, and triangular kernels). Default 1.
• base (Float64): Base to use during cover tree construction. Default 2.
• degree (Float64): Degree of polynomial kernel. Default 2.
• kernel (String): Kernel type to use: "linear", "polynomial", "cosine", "gaussian", "epanechnikov", "triangular", "hyptan". Default "linear".
• naive (Bool): If true, O(n^2) naive mode is used for computation. Default false.
• offset (Float64): Offset of kernel (for polynomial and hyptan kernels). Default 0.
• scale (Float64): Scale of kernel (for hyptan kernel). Default 1.
• single (Bool): If true, single-tree search is used (as opposed to dual-tree search. Default false.
• verbose (Bool): Display informational messages.

Result:

• A FastMKS model that can be used in a later call to mlpack_fastmks[].

Definition

@inline def mlpack_fastmks_build[R, N, H] = ext_ml_build[:mlpack_fastmks, R, N, H]

## mlpack_gmm_generate#

mlpack_gmm_generate[S, M, D, H]

A sample generator for pre-trained GMMs. Given a pre-trained GMM, this can sample new points randomly from that distribution.

Inputs:

• S: constant indicating the number of samples to generate.
• M: pre-trained GMM from mlpack_gmm_train[].
• D: constant representing the dimensionality of the model (i.e. the dimensionality of F in the call to mlpack_gmm_train[]).
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
• verbose (Bool): Display informational messages. Default false.

Result:

• A relation containing S samples from the given GMM M. The first argument is the key (an integer between 1 and S) and the rest of the arguments are each of the features.

Definition

@inline def mlpack_gmm_generate[S, M, D, H] =
ext_ml_transform[:mlpack_gmm_generate, S, M, {()}, D, H]

## mlpack_gmm_probability#

mlpack_gmm_probability[M, F, N, H]

A probability calculator for GMMs. Given a pre-trained GMM and a set of points, this can compute the probability that each point is from the given GMM.

Inputs:

• M: pre-trained GMM from mlpack_gmm_train[].
• F: relation of data points to compute the probabilities of.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• verbose (Bool): Display informational messages. Default false.

Result:

• A relation containing the keys of F (that is, the first N arguments) mapping to the probability that each of those samples arose from the GMM M.

Definition

@inline def mlpack_gmm_probability[M, F, N, H] =
ext_ml_transform[:mlpack_gmm_probability, 0, M, F, N, H]

## mlpack_gmm_train#

mlpack_gmm_train[F, N, H]

An implementation of the EM algorithm for training Gaussian mixture models (GMMs). Given a dataset, this can train a GMM for future use with other tools.

Inputs:

• R: relation of reference points that model should be built on
• N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• diagonal_covariance (Bool): Force the covariance of the Gaussians to be diagonal. This can accelerate training time significantly. Default false.
• gaussians (Int): Number of Gaussians in the GMM. Required.
• kmeans_max_iterations (Int): Maximum number of iterations for the k-means algorithm (used to initialize EM). Default 1000.
• max_iterations (Int): Maximum number of iterations of EM algorithm (passing 0 will run until convergence). Default 250.
• no_force_positive (Bool): Do not force the covariance matrices to be positive definite. Default false.
• noise (Float64): Variance of zero-mean Gaussian noise to add to data. Default 0.
• percentage (Float64): If using refined_start, specify the percentage of the dataset used for each sampling (should be between 0.0 and 1.0). Default 0.02.
• refined_start (Bool): During the initialization, use refined initial positions for k-means clustering (Bradley and Fayyad, 1998). Default false.
• samplings (Int): If using refined_start, specify the number of samplings used for initial points. Default 100.
• seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
• tolerance (Float64): Tolerance for convergence of EM. Default 1e-10.
• trials (Int): Number of trials to perform in training GMM. Default 1.
• verbose (Bool): Display informational messages. Default false.

Definition

@inline def mlpack_gmm_train[F, N, H] = ext_ml_build[:mlpack_gmm_train, F, N, H]

## mlpack_hoeffding_tree#

mlpack_hoeffding_tree[F, R, H]

An implementation of Hoeffding trees, a form of streaming decision tree for classification. Given labeled data, a Hoeffding tree can be trained. This binding accepts categorical features in F; a feature in F is interpreted as categorical if it is an entity or has String type.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• batch_mode (Bool): If true, samples will be considered in batch instead of as a stream. This generally results in better trees but at the cost of memory usage and runtime.
• bins (Int): If the domingos split strategy is used, this specifies the number of bins for each numeric split. Default 10.
• confidence (Float64): Confidence before splitting (between 0 and 1). Default 0.95.
• info_gain (Bool): If set, information gain is used instead of Gini impurity for calculating Hoeffding bounds.
• max_samples (Int): Maximum number of samples before splitting. Default 5000.
• min_samples (Int): Minimum number of samples before splitting. Default 100.
• numeric_split_strategy (String): The splitting strategy to use for numeric features: domingos or binary. Default binary.
• observations_before_binning (Int): If the domingos split strategy is used, this specifies the number of samples observed before binning is performed.
• passes (Int): Number of passes to take over the dataset. Default 1.
• verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_hoeffding_tree[F, R, H] = ext_ml_train[:mlpack_hoeffding_tree, F, R, H]

## mlpack_hoeffding_tree_predict#

mlpack_hoeffding_tree_predict[M, F, N]

Given a Hoeffding tree model trained with mlpack_hoeffding_tree[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_hoeffding_tree[] for more details.

Inputs:

• M: Hoeffding tree model to use for prediction; must be the result of a previous mlpack_hoeffding_tree[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_hoeffding_tree_predict[M, F, N] =
ext_ml_predict[:mlpack_hoeffding_tree_predict, M, F, N]

## mlpack_kernel_pca#

mlpack_kernel_pca[D, F, N, H]

An implementation of Kernel Principal Components Analysis (KPCA). This can be used to perform nonlinear dimensionality reduction or preprocessing on a given dataset.

Input options:

• D: constant indicating the desired new dimensionality of the data.
• F: relation of features to perform kernel PCA on.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• bandwidth (Float64): Bandwidth, for "gaussian" and "laplacian" kernels. Default 1.
• center (Bool): If set, the transformed data will be centered about the origin. Default false.
• degree (Float64): Degree of polynomial, for ‘polynomial’ kernel. Default 1.
• kernel (String): The kernel to use; see the linked documentation for the list of usable kernels. Default "gaussian".
• kernel_scale (Float64): Scale, for "hyptan" kernel. Default 1.
• nystroem_method (Bool): If set, the Nystroem method will be used. Default false.
• offset (Float64): Offset, for "hyptan" and "polynomial" kernels. Default 0.
• sampling (String): Sampling scheme to use for the Nystroem method: "kmeans", "random", "ordered". Default "kmeans".
• verbose (Bool): Display informational messages. Default false.

Result:

• A relation mapping keys in F (i.e. the first N arguments of F) to D values in each dimension.

Definition

@inline def mlpack_kernel_pca[D, F, N, H] =
ext_ml_transform[:mlpack_kernel_pca, D, {()}, F, N, H]

## mlpack_kfn#

mlpack_kfn[K, M, Q, N, H]

Perform k-furthest-neighbor search on a relation Q containing query points, using a model M that was built with mlpack_kfn_build[].

See also the mlpack documentation and the documentation for mlpack_kfn_build[] for more details.

Inputs:

• K: constant representing number of nearest neighbors to search for.
• M: pre-trained model for kNN; must be the result of a previous mlpack_knn_build[] call.
• Q: relation of query points; must have the same number of keys as the relation that M was built with.
• N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• algorithm (String): Type of neighbor search: “naive”, “singletree”, “dualtree”, “greedy”. Default "dual_tree".
• epsilon (Float64): If specified, will do approximate nearest neighbor search with given relative error. Default 0.
• percentage (Float64): If specified, will do approximate furthest neighbor search. Must be in the range (0,1] (decimal form). Resultant neighbors will be at least (p*100)% of the distance as the true furthest neighbor. Default 1.
• verbose (Bool): Display informational messages.

Result:

• A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (querykeys…, k, referencekeys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th furthest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_kfn[K, M, Q, N, H] = ext_ml_transform[:mlpack_kfn, K, M, Q, N, H]

## mlpack_kfn_build#

mlpack_kfn_build[R, N, H]

An implementation of k-furthest-neighbor search using single-tree and dual-tree algorithms. This can build a tree that can be saved for future use.

See also the mlpack documentation and the documentation for mlpack_kfn[] for more details.

Inputs:

• R: relation of reference points that tree should be built on
• N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• leaf_size (Int): Leaf size for tree building (used for kd-trees, vp trees, random projection trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, and octrees). Default 20.
• random_basis (Bool): Before tree-building, project the data onto a random orthogonal basis. Default false.
• seed (Int): Random seed (if 0, std::time(NULL) is used). Default 0.
• tree_type (String): Type of tree to use: "kd", "vp", "rp", "max-rp", "ub", "cover", "r", "r-star", “x”, “ball”, “hilbert-r”, “r-plus”,“r-plus-plus”,“oct”. Default“kd”.
• verbose (Bool): Display informational messages.

Result:

• A KFN model that can be used with a later call to mlpack_kfn[].

Definition

@inline def mlpack_kfn_build[R, N, H] = ext_ml_build[:mlpack_kfn, R, N, H]

## mlpack_kmeans#

mlpack_kmeans[K, F, N, H]

An implementation of several strategies for efficient k-means clustering. Given a dataset and a value of k, this computes a k-means clustering on that data.

Inputs:

• K: constant indicating the number of clusters for k-means clustering.
• F: relation of data points to cluster.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• algorithm (String): Algorithm to use for the Lloyd iteration ("naive", "pelleg-moore", "elkan", "hamerly", "dualtree", or "dualtree-covertree"). Default "naive".
• allow_empty_clusters (Bool): Allow empty clusters to persist. Default false.
• kill_empty_clusters (Bool): Remove empty clusters when they occur. Default false.
• max_iterations (Int): Maximum number of iterations before k-means terminates. Default 1000.
• percentage (Float64): Percentage of dataset to use for each refined start sampling (use when refined_start is specified). Default 0.02.
• refined_start (Bool): Use the refined initial point strategy by Bradley and Fayyad to choose initial points. Default false.
• samplings (Int): Number of samplings to perform for refined start (use when refined_start is specified). Default 100.
• seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
• verbose (Bool): Display information messages. Default false.

Result:

• A relation containing the keys in F with a cluster assignment (Int) between 1 and K as the last argument.

Definition

@inline def mlpack_kmeans[K, F, N, H] = ext_ml_transform[:mlpack_kmeans, K, {()}, F, N, H]

## mlpack_kmeans_centroids#

mlpack_kmeans_centroids[K, F, N, H]

An implementation of several strategies for efficient k-means clustering. Given a dataset and a value of k, this computes centroids for a k-means clustering on that data.

Inputs:

• K: constant indicating the number of clusters for k-means clustering.
• F: relation of data points to cluster.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• algorithm (String): Algorithm to use for the Lloyd iteration ("naive", "pelleg-moore", "elkan", "hamerly", "dualtree", or "dualtree-covertree"). Default "naive".
• allow_empty_clusters (Bool): Allow empty clusters to persist. Default false.
• kill_empty_clusters (Bool): Remove empty clusters when they occur. Default false.
• max_iterations (Int): Maximum number of iterations before k-means terminates. Default 1000.
• percentage (Float64): Percentage of dataset to use for each refined start sampling (use when refined_start is specified). Default 0.02.
• refined_start (Bool): Use the refined initial point strategy by Bradley and Fayyad to choose initial points. Default false.
• samplings (Int): Number of samplings to perform for refined start (use when refined_start is specified). Default 100.
• seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
• verbose (Bool): Display information messages. Default false.

Result:

• A relation containing a cluster index between 1 and K that maps to the centroid of each dimension in F. So, the first argument of this relation is the cluster index, and the rest correspond to the arguments of F that are after the first N key arguments.

Definition

@inline def mlpack_kmeans_centroids[K, F, N, H] =
ext_ml_transform[:mlpack_kmeans_centroids, K, {()}, F, N, H]

## mlpack_knn#

mlpack_knn[K, M, Q, N, H]

Perform k-nearest-neighbor search on a relation Q containing query points, using a model M that was built with mlpack_knn_build[].

See also the mlpack documentation and the documentation for mlpack_knn_build[] for more details.

Inputs:

• K: constant representing number of nearest neighbors to search for.
• M: pre-trained model for kNN; must be the result of a previous mlpack_knn_build[] call.
• Q: relation of query points; must have the same number of keys as the relation that M was built with.
• N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• algorithm (String): Type of neighbor search: “naive”, “singletree”, “dualtree”, “greedy”. Default "dual_tree".
• epsilon (Float64): If specified, will do approximate nearest neighbor search with given relative error. Default 0.
• verbose (Bool): Display informational messages.

Result:

• A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (querykeys…, k, referencekeys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th nearest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_knn[K, M, Q, N, H] = ext_ml_transform[:mlpack_knn, K, M, Q, N, H]

## mlpack_knn_build#

mlpack_knn_build[R, N, H]

An implementation of k-nearest-neighbor search using single-tree and dual-tree algorithms. Given a set of reference points and query points, this can build trees that can be used in later calls to mlpack_knn[].

See also the mlpack documentation and the documentation for mlpack_knn[] for more details.

Inputs:

• R: relation of reference points that tree should be built on
• N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• leaf_size (Int): Leaf size for tree building (used for kd-trees, vp trees, random projection trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, spill trees, and octrees). Default 20.
• random_basis (Bool): before tree-building, project the data onto a random orthogonal basis. Default false.
• rho (Float64): Balance threshold (only valid for spill trees). Default 0.7.
• tau (Float64): Overlapping size (only valid for spill trees). Default 0.
• tree_type (String): Type of tree to use: “kd”, “vp”, “rp”, “max-rp”, “ub”, “cover”, “r”, “r-star”, “x”, “ball”, “hilbert-r”, “r-plus”, “r-plus-plus”, “spill”, “oct”. Default "kd".
• verbose (Bool): Display informational messages.

Result:

• A KNN model that can be used in a later call to mlpack_knn[].

Definition

@inline def mlpack_knn_build[R, N, H] = ext_ml_build[:mlpack_knn, R, N, H]

## mlpack_krann#

mlpack_krann[K, M, Q, N, H]

Perform k-rank-approximate-nearest-neighbor search on a relation Q containing query points, using a model M that was built with mlpack_krann_build[].

See also the mlpack documentation and the documentation for mlpack_krann_build[] for more details.

Inputs:

• K: constant representing number of nearest neighbors to search for.
• M: pre-trained model for kRANN; must be the result of a previous mlpack_krann_build[] call.
• Q: relation of query points; must have the same number of keys as the relation that M was built with.
• N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• alpha (Float64): The desired success probability. Default 0.95.
• tau (Float64): The allowed rank-error in terms of the percentile of the data. Default 5.
• verbose (Bool): Display informational messages.

Result:

• A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (querykeys…, k, referencekeys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th rank-approximate nearest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_krann[K, M, Q, N, H] = ext_ml_transform[:mlpack_krann, K, M, Q, N, H]

## mlpack_krann_build#

mlpack_krann_build[R, N, H]

An implementation of rank-approximate k-nearest-neighbor search (kRANN) using single-tree and dual-tree algorithms. Given a set of reference points and query points, this can build trees that can be used in later calls to mlpack_krann[].

See also the mlpack documentation and the documentation for mlpack_krann[] for more details.

Inputs:

• R: relation of reference points that tree should be built on
• N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• first_leaf_exact (Bool): The flag to trigger sampling only after exactly exploring the first leaf. Default false.
• leaf_size (Int): Leaf size for tree building (used for kd-trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, and octrees). Default 20.
• naive (Bool): If true, sampling will be done without using a tree. Default false.
• random_basis (Bool): Before tree-building, project the data onto a random orthogonal basis. Default false.
• sample_at_leaves (Bool): The flag to trigger sampling at leaves. Default false.
• seed (Int): Random seed (if 0, std::time(NULL) is used). Default 0.
• single_mode (Bool): If true, single-tree search is used (as opposed to dual-tree search). Default false.
• single_sample_limit (Int): The limit on the maximum number of samples (and hence the largest node you can approximate). Default 20.
• tree_type (String): Type of tree to use: "kd", "ub", "cover", "r", "x", "r-star", "hilbert-r", "r-plus", "r-plus-plus", "oct". Default "kd".
• verbose (Bool): Display informational messages.

Result:

• A rank-approximate KNN model that can be used in a later call to mlpack_krann[].

Definition

@inline def mlpack_krann_build[R, N, H] = ext_ml_build[:mlpack_krann, R, N, H]

## mlpack_lars#

mlpack_lars[F, R, H]

An implementation of Least Angle Regression (Stagewise/laSso), also known as LARS. This can train a LARS/LASSO/Elastic Net model.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• lambda1 (Float64): Regularization parameter for l1-norm penalty. Default 0.
• lambda2 (Float64): Regularization parameter for l2-norm penalty. Default 0.
• use_cholesky (Bool): Use Cholesky decomposition during computation rather than explicitly computing the full Gram matrix.
• verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_lars[F, R, H] = ext_ml_train[:mlpack_lars, F, R, H]

## mlpack_lars_predict#

mlpack_lars_predict[M, F, N]

Given a LARS model trained with mlpack_lars[], make predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_lars[] for more details.

Inputs:

• M: LARS model to use for prediction; must be the result of a previous mlpack_lars[] call
• F: relation of test features for which predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_lars_predict[M, F, N] = ext_ml_predict[:mlpack_lars_predict, M, F, N]

## mlpack_linear_regression#

mlpack_linear_regression[F, R, H]

An implementation of simple linear regression and ridge regression using ordinary least squares. Given a dataset and responses, a model can be trained.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• lambda (Float64): Tikhonov regularization for ridge regression. If 0, the method reduces to linear regression. Default 0.
• verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_linear_regression[F, R, H] =
ext_ml_train[:mlpack_linear_regression, F, R, H]

## mlpack_linear_regression_predict#

mlpack_linear_regression_predict[M, F, N]

Given a linear regression model trained with mlpack_linear_regression[], make predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_linear_regression[] for more details.

Inputs:

• M: linear regression model to use for prediction; must be the result of a previous mlpack_linear_regression[] call
• F: relation of test features for which predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_linear_regression_predict[M, F, N] =
ext_ml_predict[:mlpack_linear_regression_predict, M, F, N]

## mlpack_linear_svm#

mlpack_linear_svm[F, R, H]

An implementation of linear SVM for multiclass classification. Given labeled data, a model can be trained and saved for future use.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• delta (Float64): margin of difference between correct class and other classes (default 1.0).
• epochs (Int): maximum number of full epochs over dataset for psgd (default 50).
• lambda (Float64): L2-regularization parameter for training (default 0.0001).
• max_iterations (Int): Maximum iterations for optimizer (0 indicates no limit). Default 10000.
• no_intercept (Bool): Do not add the intercept term to the model (default false).
• num_classes (Int): Number of classes for classification; if unspecified (or 0), the number of classes found in the labels will be used. Default 0.
• optimizer (String): Optimizer to use for training ("lbfgs" or "psgd"). Default "lbfgs".
• seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
• shuffle (Bool): If true, don’t shuffle the order in which data points are visited for parallel SGD. Default false.
• step_size (Float64): Step size for parallel SGD optimizer. Default 0.01.
• tolerance (Float64): Convergence tolerance for optimizer. Default 1e-10.
• verbose (Bool): Display informational messages. Default false.

Definition

@inline def mlpack_linear_svm[F, R, H] = ext_ml_train[:mlpack_linear_svm, F, R, H]

## mlpack_linear_svm_predict#

mlpack_linear_svm_predict[F, R, H]

Given a linear SVM model trained with mlpack_linear_svm[], make predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_linear_svm[] for more details.

Inputs:

• M: linear SVM model to use for prediction; must be the result of a previous mlpack_linear_svm[] call
• F: relation of test features for which predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_linear_svm_predict[M, F, N] =
ext_ml_predict[:mlpack_linear_svm_predict, M, F, N]

## mlpack_logistic_regression#

mlpack_logistic_regression[F, R, H]

An implementation of L2-regularized logistic regression for two-class classification. Given labeled data, a model can be trained and saved for future use.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• batch_size (Int): Batch size for SGD. Default 64.
• decision_boundary (Float64): Decision boundary for prediction; if the logistic function for a point is less than the boundary, the class is taken to be 1; otherwise, the class is 2. Default 0.5.
• lambda (Float64): L2-regularization parameter for training. Default 0.
• max_iterations (Int): Maximum iterations for optimizer (0 indicates no limit). Default 10000.
• optimizer (String): Optimizer to use for training ("lbfgs" or "sgd"). Default "lbfgs".
• step_size (Float64): Step size for SGD optimizer. Default 0.01.
• tolerance (Float64): Convergence tolerance for optimizer. Default 1e-10.
• verbose (Bool): Display informational messages. Default false.

Definition

@inline def mlpack_logistic_regression[F, R, H] =
ext_ml_train[:mlpack_logistic_regression, F, R, H]

## mlpack_logistic_regression_predict#

mlpack_logistic_regression_predict[F, R, H]

Given a logistic regression model trained with mlpack_logistic_regression[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_logistic_regression[] for more details.

Inputs:

• M: logistic regression model to use for class predictions; must be the result of a previous mlpack_logistic_regression[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_logistic_regression_predict[M, F, N] =
ext_ml_predict[:mlpack_logistic_regression_predict, M, F, N]

## mlpack_lsh#

mlpack_lsh[K, M, Q, N, H]

Perform approximate k-nearest-neighbor search on a relation Q containing query points, using a model M that was built with mlpack_lsh_build[].

See also the mlpack documentation and the documentation for mlpack_lsh_build[] for more details.

Inputs:

• K: constant representing number of nearest neighbors to search for.
• M: pre-trained model for kNN; must be the result of a previous mlpack_knn_build[] call.
• Q: relation of query points; must have the same number of keys as the relation that M was built with.
• N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• num_probes (Int): Number of additional probes for multiprobe LSH; if 0, traditional LSH is used. Default 0.
• verbose (Bool): Display informational messages.

Result:

• A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (querykeys…, k, referencekeys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th nearest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_lsh[K, M, Q, N, H] = ext_ml_transform[:mlpack_lsh, K, M, Q, N, H]

## mlpack_lsh_build#

mlpack_lsh_build[R, N, H]

An implementation of approximate k-nearest-neighbor search with locality-sensitive hashing (LSH). Given a set of reference points, this will build an LSH model.

See also the mlpack documentation and the documentation for mlpack_lsh[] for more details.

Inputs:

• R: relation of reference points that tree should be built on
• N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• bucket_size (Int): The size of a bucket in the second level hash. Default 500.
• hash_width (Float64): The hash width for the first-level hashing in the LSH preprocessing. By default, the LSH class automatically estimates a hash width for its use.
• projections (Int): The number of hash functions for each table. Default 10.
• second_hash_size (Int): The size of the second level hash table. Default 99901.
• seed (Int): Random seed. If 0, ‘std::time(NULL)’ is used. Default 0.
• tables (Int): The number of hash tables to be used. Default 30.
• verbose (Bool): Display informational messages.

Result:

• An LSH model that can be used in a later call to mlpack_lsh[].

Definition

@inline def mlpack_lsh_build[R, N, H] = ext_ml_build[:mlpack_lsh, R, N, H]

## mlpack_mean_shift#

mlpack_mean_shift[F, N, H]

A clustering of the data using the mean shift algorithm. Uses a fast implementation of mean-shift clustering using dual-tree range search.

See the mlpack documentation for more details.

Inputs:

• F: relation of data points to cluster.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• force_convergence (Bool): If specified, the mean shift algorithm will continue running regardless of max_iterations until the clusters converge. Default false.
• max_iterations (Int): Maximum number of iterations before mean shift terminates. Default 1000.
• radius (Float64): If the distance between two centroids is less than the given radius, one will be removed. A radius of 0 or less means an estimate will be calculated and used for the radius. Default 0.
• verbose (Bool): Display informational messages. Default false.

Result:

• A relation containing the keys in F with a cluster assignment (Int) as the last element. If the key was not assigned to a cluster, the cluster assignment will be 0.

Definition

@inline def mlpack_mean_shift[F, N, H] =
ext_ml_transform[:mlpack_mean_shift, 0, {()}, F, N, H]

## mlpack_nbc#

mlpack_nbc[F, R, H]

An implementation of the Naive Bayes Classifier, used for classification. Given labeled data, an NBC model can be trained.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• incremental_variance (Bool): The variance of each class will be calculated incrementally.
• verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_nbc[F, R, H] = ext_ml_train[:mlpack_nbc, F, R, H]

## mlpack_nbc_predict#

mlpack_nbc_predict[M, F, N]

Given a Naive Bayes classifier model trained with mlpack_nbc[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_nbc[] for more details.

Inputs:

• M: Naive Bayes classification model to use for prediction; must be the result of a previous mlpack_nbc[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_nbc_predict[M, F, N] = ext_ml_predict[:mlpack_nbc_predict, M, F, N]

## mlpack_nmf#

mlpack_nmf[R, F, N, H]

An implementation of non-negative matrix factorization. This can be used to decompose an input dataset into two low-rank non-negative components.

Inputs:

• R: constant indicating the rank of the low-rank decomposition.
• F: relation of features to decompose into two low-rank matrices.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• max_iterations (Int): Number of iterations before NMF terminates (0 runs until convergence). Default 10000.
• min_residue (Float64): The minimum root mean square residue allowed for each iteration, below which the program terminates. Default 1e-05.
• seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
• update_rules (String): Update rules for each iteration; ( "multdist" | "multdiv" | "als" ). Default "multdist".
• verbose (Bool): Display informational messages. Default false.

Result:

• A relation encoding both the low-rank H and W matrices. However, it is slightly confusing because rows in W are keyed by the first N arguments of F, but rows in H are keyed by integers. Thus, the first argument of the output is either 1 if the tuple corresponds to a row of W or 2 if the tuple corresponds to a row of H. Then, if the first argument is 1, then the next N arguments are keys from F; otherwise they are zero values that should be ignored. After that, if the first argument is 2, the next argument is the (Int) row index for tuples pertaining to H; otherwise they are zero values that should be ignored. The following argument is the (Int) index of the argument that the tuple pertains to in W or H. The last argument is the (Float64) value in either W or H referenced by the previous arguments.

In some sense, the format of the result can be understood as an “interleaved sparse representation” of W and H. We are forced to do this in part because Rel cannot currently return two relations easily from one call.

Definition

@inline def mlpack_nmf[R, F, N, H] = ext_ml_transform[:mlpack_nmf, R, {()}, F, N, H]

## mlpack_pca#

mlpack_pca[D, F, N, H]

An implementation of several strategies for principal components analysis (PCA), a common preprocessing step. Given a dataset and a desired new dimensionality, this can reduce the dimensionality of the data using the linear transformation determined by PCA.

Input options:

• D: constant indicating the desired new dimensionality of the data.
• F: relation of features to perform PCA on.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• decomposition_method (String): Method used for the principal components analysis: "exact", "randomized", "randomized-block-krylov", "quic". Default "exact".
• scale (Bool): If set, the data will be scaled before running PCA, such that the variance of each feature is 1. Default false.
• verbose (Bool): Display informational messages. Default false.

Result:

• A relation mapping keys in F (i.e. the first N arguments of F) to D values in each dimension.

Definition

@inline def mlpack_pca[D, F, N, H] = ext_ml_transform[:mlpack_pca, D, {()}, F, N, H]

## mlpack_perceptron#

mlpack_perceptron[F, R, H]

An implementation of a perceptron–-a single level neural network–-for classification. Given labeled data, a perceptron can be trained.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• max_iterations (Int): The maximum number of iterations the perceptron is to be run. Default 1000.
• verbose (Bool): Display informational messages and the full list of parameters and

Definition

@inline def mlpack_perceptron[F, R, H] = ext_ml_train[:mlpack_perceptron, F, R, H]

## mlpack_perceptron_predict#

mlpack_perceptron_predict[M, F, N]

Given a perceptron model trained with mlpack_perceptron[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_perceptron[] for more details.

Inputs:

• M: Perceptron model to use for prediction; must be the result of a previous mlpack_perceptron[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_perceptron_predict[M, F, N] =
ext_ml_predict[:mlpack_perceptron_predict, M, F, N]

## mlpack_preprocess_split#

mlpack_preprocess_split[F, H]

This utility takes a dataset and splits it into a training set and a test set. Before the split, the points in the dataset are randomly reordered. The percentage of the dataset to be used as the test set can be specified with the test_ratio parameter; the default is 0.2 (20%).

Input options:

• F: relation of features to split. If you want to split labels too, they should be included in this relation.
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• no_shuffle (Bool): Avoid shuffling the data before splitting. Default false.
• seed (Int): Random seed (0 for std::time(NULL)). Default 0.
• test_ratio (Float64): Ratio of test set; if not set, the ratio defaults to 0.2.
• verbose (Bool): Display informational messages. Default false.

Result:

• A relation F with membership in the training or test set prepended. So, if (t...) was a tuple in F, (set, t...) will be included where set is 1 if the point t... is a part of the training set, and set is 2 if the point is a part of the test set.

Definition

@inline def mlpack_preprocess_split[F, H] =
ext_ml_transform[:mlpack_preprocess_split, 0, {()}, F, 0, H]

mlpack_radical[F, N, H]

An implementation of RADICAL, a method for independent component analysis (ICA). Given a dataset, this can decompose the dataset into an independent component matrix; this can be useful for preprocessing.

Input options:

• F: relation of features to perform RADICAL on.
• N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
• H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• angles (Int): Number of angles to consider in brute-force search during Radical2D. Default 150.
• noise_std_dev (Float64): Standard deviation of Gaussian noise. Default 0.175.
• objective (Bool): If set, an estimate of the final objective function is printed. Default false.
• replicates (Int): Number of Gaussian-perturbed replicates to use (per point) in Radical2D. Default 30.
• sweeps (Int): Number of sweeps; each sweep calls Radical2D once for each pair of dimensions. Default 0.
• verbose (Bool): Display informational messages. Default false.

Result:

• A relation mapping keys in F (i.e. the first N arguments of F) to independent component values in each dimension.

Definition

@inline def mlpack_radical[F, N, H] = ext_ml_transform[:mlpack_radical, 0, {()}, F, N, H]

## mlpack_random_forest#

mlpack_random_forest[F, R, H]

An implementation of the standard random forest algorithm by Leo Breiman for classification. Given labeled data, a random forest can be trained.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• maximum_depth (Int): Maximum depth of the tree (0 means no limit). Default 0.
• minimum_gain_split (Float64): Minimum gain needed to make a split when building a tree. Default 0.0.
• minimum_leaf_size (Int): Minimum number of points in each leaf node. Default 1.
• num_trees (Int): Number of trees in the random forest. Default 10.
• print_training_accuracy (Bool): If set, then the accuracy of the model on the training set will be predicted (verbose must also be specified).
• seed (Int): Random seed. If 0, ‘std::time(NULL)’ is used. Default 0.
• subspace_dim (Int): Dimensionality of random subspace to use for each split. 0 will autoselect the square root of data dimensionality. Default 0.
• verbose (Bool): Display informational messages and the full list of parameters and

Definition

@inline def mlpack_random_forest[F, R, H] =
ext_ml_train[:mlpack_random_forest, F, R, H]

## mlpack_random_forest_predict#

mlpack_random_forest_predict[M, F, N]

Given a random forest model trained with mlpack_random_forest[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_random_forest[] for more details.

Inputs:

• M: random forest model to use for prediction; must be the result of a previous mlpack_random_forest[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_random_forest_predict[M, F, N] =
ext_ml_predict[:mlpack_random_forest_predict, M, F, N]

## mlpack_softmax_regression#

mlpack_softmax_regression[F, R, H]

An implementation of softmax regression for classification, which is a multiclass generalization of logistic regression. Given labeled data, a softmax regression model can be trained and saved for future use.

Inputs:

• F: relation of features to learn on
• R: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• lambda (Float64): L2-regularization constant. Default 0.0001.
• max_iterations (Int): Maximum number of iterations before termination. Default 400.
• no_intercept (Bool): Do not add the intercept term to the model.
• number_of_classes (Int): Number of classes for classification; if unspecified (or 0), the number of classes found in the labels will be used. Default 0.
• verbose (Bool): Display informational messages and the full list of parameters and

Definition

@inline def mlpack_softmax_regression[F, R, H] =
ext_ml_train[:mlpack_softmax_regression, F, R, H]

## mlpack_softmax_regression_predict#

mlpack_softmax_regression_predict[M, F, N]

Given a softmax regression model trained with mlpack_softmax_regression[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_softmax_regression[] for more details.

Inputs:

• M: softmax regression model to use for prediction; must be the result of a previous mlpack_softmax_regression[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def mlpack_softmax_regression_predict[M, F, N] =
ext_ml_predict[:mlpack_softmax_regression_predict, M, F, N]

## xgboost_classifier#

xgboost_classifier[F, L, H]

A binding of the xgboost() function to train an XGBoost model (via XGBoost.jl). This fits a boosted tree model with the XGBoost algorithm to the features F and labels L, using hyperparameters specified in the relation H.

If you would like to train a regression model with XGBoost, see xgboost_regressor[].

Note that there are very many hyperparameters… all are optional. Here, we only provide documentation for common parameters. More documentation on each of these parameters is available in the link above, as well as documentation for many other less common hyperparameters not listed here.

Inputs:

• F: relation of features to learn on
• L: relation of labels; the last variable should be the label; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters (incomplete list):

• num_round (Int): Number of rounds of boosting to perform. (Default 50.)
• booster (String): Which booster to use. Can be "gbtree", "gblinear" or "dart"; "gbtree" and "dart" use tree based models while "gblinear" uses linear functions. (Default “gbtree”.)
• verbosity (Int): Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). (Default 1.)
• objective (String): Specify the learning task and the corresponding learning objective. Valid options include "binary:logistic", "binary:hinge", "multi:softmax", and other classification objectives listed in the XGBoost documentation. (Default "multi:softmax".)
• base_score (Float64): The initial prediction score of all instances. (Default 0.5.)
• eval_metric (String): Evaluation metrics for validation data. Valid choices include "merror", "error", "logloss", "auc", "aucpr", "ndcg", "map", and other classification evaluation metrics specified in the XGBoost documentation. (Default set based on objective value.)
• seed (Int): Random number seed. (Default 0.)
• eta (Float64): Step size shrinkage used in update to prevent overfitting. (Default 0.3.)
• gamma (Float64): Minimum loss reduction required to make a further partition on a leaf node of the tree. (Default 0.0.)
• max_depth (Int): Maximum depth of a tree. (Default 6.)
• min_child_weight (Float64): Minimum sum of instance weight (hessian) needed in a child. (Default 1.0.)
• max_delta_step (Float64): Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. (Default 0.0.)

Definition

@inline def xgboost_classifier[F, L, H] = ext_ml_train[:xgboost_classifier, F, L, H]

## xgboost_classifier_predict#

xgboost_classifier_predict[M, F, N]

Given an XGBoost classification model trained with xgboost_classifier[], make class predictions on a test set.

For more information, see the documentation for xgboost_classifier[] and the XGBoost documentation.

Inputs:

• M: XGBoost classification model to use for prediction; must be the result of a previous xgboost_classifier[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def xgboost_classifier_predict[M, F, N] =
ext_ml_predict[:xgboost_classifier_predict, M, F, N]

## xgboost_classifier_probabilities#

xgboost_classifier_probabilities[M, F, N]

Given an XGBoost classification model trained with xgboost_classifier[], compute the probabilities of each class for each point in F.

Note that M must be an XGBoost classification model trained with the binary:logistic or multi:softprob objectives.

For more information, see the documentation for xgboost_classifier[] and the XGBoost documentation.

Inputs:

• M: XGBoost classification model to use for prediction; must be the result of a previous xgboost_classifier[] call
• F: relation of test features for which class predictions will be computed
• N: constant Int representing the number of keys in F

Result:

• A relation probabilities(keys..., class, prob) where keys... are the keys of each point in F, class takes values for every class in M, and prob is the probability of that class for those keys.

Definition

@inline def xgboost_classifier_probabilities[M, F, N] =
ext_ml_transform[:xgboost_classifier_probabilities, 0, M, F, N, {}]

## xgboost_feature_importances#

xgboost_feature_importances[M, F]

Given an XGBoost model trained with xgboost_classifier[] or xgboost_regressor[] and the feature module F that it was trained with (or an equivalent feature module with the same feature names), return an arity-2 relation mapping feature names (as Strings) to feature importance values.

Note that this relation may be empty if feature importance cannot be computed! (This could happen, for instance, if the model’s trees don’t have any splits at all.)

For more information, see the documentation for xgboost_classifier[], xgboost_regressor[], and the importances() function from XGBoost.jl.

Inputs:

• M: XGBoost classification or regression model; must be the result of a previous xgboost_classifier[] or xgboost_regressor[] call.
• F: relation containing all of the same features that the model was trained on
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

• type: type of feature importance to return; valid options are “gain”, “cover”, and “freq”. Default “gain”.

Definition

@inline def xgboost_feature_importances[M, F, H](f_str, imp) = exists(i :
// We need to get a sorted list of specializations in F, but to do this we must
// convert them to strings (we cannot sort symbols at the moment).
sort[f : exists(f_sym, xs... : F(f_sym, xs...) and f = string[f_sym])](i, f_str) and
// :xgboost_feature_importances produces mappings (i => importance) for integer i,
// and these will match the sorted index of the feature names.  It's possible we might
// not get a feature back!  That means we need to insert a default value.
(
ext_ml_transform[:xgboost_feature_importances, 0, M, {()}, 0, H](i, imp) or
(not exists(v : ext_ml_transform[
:xgboost_feature_importances, 0, M, {()}, 0, H
](i, v)) and imp = 0.0)
)
)

## xgboost_regressor#

xgboost_regressor[F, R, H]

A binding of the xgboost() function to train an XGBoost regression model (via XGBoost.jl). This fits a boosted tree model with the XGBoost algorithm to the features F and responses R, using hyperparameters specified in the relation H.

If you would like to train a classification model with XGBoost, see xgboost_classifier[].

Note that there are very many hyperparameters… all are optional. Here, we only provide documentation for common parameters. More documentation on each of these parameters is available in the link above, as well as documentation for many other less common hyperparameters not listed here.

Inputs:

• F: relation of features to learn on
• L: relation of responses; the last variable should be the response; everything else should be keys
• H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters (incomplete list):

• num_round (Int): Number of rounds of boosting to perform. (Default 50.)
• booster (String): Which booster to use. Can be "gbtree", "gblinear" or "dart"; "gbtree" and "dart" use tree based models while "gblinear" uses linear functions. (Default "gbtree".)
• verbosity (Int): Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). (Default 1.)
• objective (String): Specify the learning task and the corresponding learning objective. Valid options include "reg:squarederror", "reg:squaredlogerror", "reg:logistic", "reg:psuedohubererror", "reg:gamma", "reg:tweedie", and other regression objectives listed in the XGBoost documentation. (Default "reg:squarederror".)
• base_score (Float64): The initial prediction score of all instances. (Default 0.5.)
• eval_metric (String): Evaluation metrics for validation data. Valid choices include "rmse", "rmsle", "mae", "mape", "mphe", and other regression evaluation metrics specified in the XGBoost documentation. (Default set based on objective value.)
• seed (Int): Random number seed. (Default 0.)
• eta (Float64): Step size shrinkage used in update to prevent overfitting. (Default 0.3.)
• gamma (Float64): Minimum loss reduction required to make a further partition on a leaf node of the tree. (Default 0.0.)
• max_depth (Int): Maximum depth of a tree. (Default 6.)
• min_child_weight (Float64): Minimum sum of instance weight (hessian) needed in a child. (Default 1.0.)
• max_delta_step (Float64): Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. (Default 0.0.)

Definition

@inline def xgboost_regressor[F, R, H] = ext_ml_train[:xgboost_regressor, F, R, H]

## xgboost_regressor_predict#

xgboost_regressor_predict[M, F, N]

Given an XGBoost regression model trained with xgboost_regressor[], make regression predictions on a test set.

For more information, see the documentation for xgboost_regressor[] and the XGBoost documentation.

Inputs:

• M: XGBoost regression model to use for prediction; must be the result of a previous xgboost_regressor[] call
• F: relation of test features for which regression predictions will be computed
• N: constant Int representing the number of keys in F

Definition

@inline def xgboost_regressor_predict[M, F, N] =
ext_ml_predict[:xgboost_regressor_predict, M, F, N]