The Machine Learning Library (ml)

Collection of machine learning tools.

flatten

flatten[R]

This is a utility to “flatten” a module (e.g. specialized relation). This should only be used internally, inside the external machine learning bindings. Do not use it in your program!

Definition

@inline def flatten = rel_primitive_flatten

@inline def ext_ml_train[X, F, R, H] = rel_primitive_ext_ml_train[X, flatten[F], R, H]
@inline def ext_ml_predict[X, M, F, N] = rel_primitive_ext_ml_predict[X, M, flatten[F], N]
@inline def ext_ml_build[X, F, N, H] = rel_primitive_ext_ml_build[X, flatten[F], N, H]
@inline def ext_ml_transform[X, K, M, F, N, H] =
rel_primitive_ext_ml_transform[X, K, M, flatten[F], N, H]

mlpack_adaboost

mlpack_adaboost[F, R, H]

An implementation of the AdaBoost.MH (Adaptive Boosting) algorithm for classification. This can be used to train an AdaBoost model on labeled data.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • iterations (Int): The maximum number of boosting iterations to be run (0 will run until convergence.) Default 1000.
  • tolerance (Float64): The tolerance for change in values of the weighted error during training. Default 1e-10.
  • verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.
  • weak_learner (String): The type of weak learner to use: decision_stump, or perceptron. Default decision_stump.

Definition

@inline def mlpack_adaboost[F, R, H] = ext_ml_train[:mlpack_adaboost, F, R, H]

mlpack_adaboost_predict

mlpack_adaboost_predict[M, F, N]

Given an AdaBoost.MH model trained with mlpack_adaboost[, make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_adaboost[] for more details.

Inputs:

  • M: AdaBoost model to use for prediction; must be the result of a previous mlpack_adaboost[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_adaboost_predict[M, F, N] =
ext_ml_predict[:mlpack_adaboost_predict, M, F, N]

mlpack_decision_tree

mlpack_decision_tree[F, R, H]

An implementation of an ID3-style decision tree for classification, which supports categorical data. This binding accepts categorical features in F; a feature in F is interpreted as categorical if it is an entity or has String type.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • maximum_depth (Int): Maximum depth of the tree (0 means no limit). Default 0.
  • minimum_gain_split (Float64): Minimum gain for node splitting. Default 1e-7.
  • minimum_leaf_size (Int): Minimum number of points in a leaf. Default 20.
  • print_training_accuracy (Bool): Print the training accuracy. Default false.
  • verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_decision_tree[F, R, H] = ext_ml_train[:mlpack_decision_tree, F, R, H]

mlpack_decision_tree_predict

mlpack_decision_tree_predict[M, F, N]

Given a decision tree model trained with mlpack_decision_tree[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_decision_tree[] for more details.

Inputs:

  • M: decision tree model to use for prediction; must be the result of a previous mlpack_decision_tree[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_decision_tree_predict[M, F, N] =
ext_ml_predict[:mlpack_decision_tree_predict, M, F, N]

mlpack_hoeffding_tree

mlpack_hoeffding_tree[F, R, H]

An implementation of Hoeffding trees, a form of streaming decision tree for classification. Given labeled data, a Hoeffding tree can be trained. This binding accepts categorical features in F; a feature in F is interpreted as categorical if it is an entity or has String type.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • batch_mode (Bool): If true, samples will be considered in batch instead of as a stream. This generally results in better trees but at the cost of memory usage and runtime.
  • bins (Int): If the domingos split strategy is used, this specifies the number of bins for each numeric split. Default 10.
  • confidence (Float64): Confidence before splitting (between 0 and 1). Default 0.95.
  • info_gain (Bool): If set, information gain is used instead of Gini impurity for calculating Hoeffding bounds.
  • max_samples (Int): Maximum number of samples before splitting. Default 5000.
  • min_samples (Int): Minimum number of samples before splitting. Default 100.
  • numeric_split_strategy (String): The splitting strategy to use for numeric features: domingos or binary. Default binary.
  • observations_before_binning (Int): If the domingos split strategy is used, this specifies the number of samples observed before binning is performed.
  • passes (Int): Number of passes to take over the dataset. Default 1.
  • verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_hoeffding_tree[F, R, H] = ext_ml_train[:mlpack_hoeffding_tree, F, R, H]

mlpack_hoeffding_tree_predict

mlpack_hoeffding_tree_predict[M, F, N]

Given a Hoeffding tree model trained with mlpack_hoeffding_tree[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_hoeffding_tree[] for more details.

Inputs:

  • M: Hoeffding tree model to use for prediction; must be the result of a previous mlpack_hoeffding_tree[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_hoeffding_tree_predict[M, F, N] =
ext_ml_predict[:mlpack_hoeffding_tree_predict, M, F, N]

mlpack_lars

mlpack_lars[F, R, H]

An implementation of Least Angle Regression (Stagewise/laSso), also known as LARS. This can train a LARS/LASSO/Elastic Net model.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • lambda1 (Float64): Regularization parameter for l1-norm penalty. Default 0.
  • lambda2 (Float64): Regularization parameter for l2-norm penalty. Default 0.
  • use_cholesky (Bool): Use Cholesky decomposition during computation rather than explicitly computing the full Gram matrix.
  • verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_lars[F, R, H] = ext_ml_train[:mlpack_lars, F, R, H]

mlpack_lars_predict

mlpack_lars_predict[M, F, N]

Given a LARS model trained with mlpack_lars[], make predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_lars[] for more details.

Inputs:

  • M: LARS model to use for prediction; must be the result of a previous mlpack_lars[] call
  • F: relation of test features for which predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_lars_predict[M, F, N] = ext_ml_predict[:mlpack_lars_predict, M, F, N]

mlpack_linear_regression

mlpack_linear_regression[F, R, H]

An implementation of simple linear regression and ridge regression using ordinary least squares. Given a dataset and responses, a model can be trained.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • lambda (Float64): Tikhonov regularization for ridge regression. If 0, the method reduces to linear regression. Default 0.
  • verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_linear_regression[F, R, H] =
ext_ml_train[:mlpack_linear_regression, F, R, H]

mlpack_linear_regression_predict

mlpack_linear_regression_predict[M, F, N]

Given a linear regression model trained with mlpack_linear_regression[], make predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_linear_regression[] for more details.

Inputs:

  • M: linear regression model to use for prediction; must be the result of a previous mlpack_linear_regression[] call
  • F: relation of test features for which predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_linear_regression_predict[M, F, N] =
ext_ml_predict[:mlpack_linear_regression_predict, M, F, N]

mlpack_linear_svm

mlpack_linear_svm[F, R, H]

An implementation of linear SVM for multiclass classification. Given labeled data, a model can be trained and saved for future use.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • delta (Float64): margin of difference between correct class and other classes (default 1.0).
  • epochs (Int): maximum number of full epochs over dataset for psgd (default 50).
  • lambda (Float64): L2-regularization parameter for training (default 0.0001).
  • max_iterations (Int): Maximum iterations for optimizer (0 indicates no limit). Default 10000.
  • no_intercept (Bool): Do not add the intercept term to the model (default false).
  • num_classes (Int): Number of classes for classification; if unspecified (or 0), the number of classes found in the labels will be used. Default 0.
  • optimizer (String): Optimizer to use for training ("lbfgs" or "psgd"). Default "lbfgs".
  • seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
  • shuffle (Bool): If true, don’t shuffle the order in which data points are visited for parallel SGD. Default false.
  • step_size (Float64): Step size for parallel SGD optimizer. Default 0.01.
  • tolerance (Float64): Convergence tolerance for optimizer. Default 1e-10.
  • verbose (Bool): Display informational messages. Default false.

Definition

@inline def mlpack_linear_svm[F, R, H] = ext_ml_train[:mlpack_linear_svm, F, R, H]

mlpack_linear_svm_predict

mlpack_linear_svm_predict[F, R, H]

Given a linear SVM model trained with mlpack_linear_svm[], make predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_linear_svm[] for more details.

Inputs:

  • M: linear SVM model to use for prediction; must be the result of a previous mlpack_linear_svm[] call
  • F: relation of test features for which predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_linear_svm_predict[M, F, N] =
ext_ml_predict[:mlpack_linear_svm_predict, M, F, N]

mlpack_logistic_regression

mlpack_logistic_regression[F, R, H]

An implementation of L2-regularized logistic regression for two-class classification. Given labeled data, a model can be trained and saved for future use.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • batch_size (Int): Batch size for SGD. Default 64.
  • decision_boundary (Float64): Decision boundary for prediction; if the logistic function for a point is less than the boundary, the class is taken to be 1; otherwise, the class is 2. Default 0.5.
  • lambda (Float64): L2-regularization parameter for training. Default 0.
  • max_iterations (Int): Maximum iterations for optimizer (0 indicates no limit). Default 10000.
  • optimizer (String): Optimizer to use for training ("lbfgs" or "sgd"). Default "lbfgs".
  • step_size (Float64): Step size for SGD optimizer. Default 0.01.
  • tolerance (Float64): Convergence tolerance for optimizer. Default 1e-10.
  • verbose (Bool): Display informational messages. Default false.

Definition

@inline def mlpack_logistic_regression[F, R, H] =
ext_ml_train[:mlpack_logistic_regression, F, R, H]

mlpack_logistic_regression_predict

mlpack_logistic_regression_predict[F, R, H]

Given a logistic regression model trained with mlpack_logistic_regression[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_logistic_regression[] for more details.

Inputs:

  • M: logistic regression model to use for class predictions; must be the result of a previous mlpack_logistic_regression[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_logistic_regression_predict[M, F, N] =
ext_ml_predict[:mlpack_logistic_regression_predict, M, F, N]

mlpack_nbc

mlpack_nbc[F, R, H]

An implementation of the Naive Bayes Classifier, used for classification. Given labeled data, an NBC model can be trained.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • incremental_variance (Bool): The variance of each class will be calculated incrementally.
  • verbose (Bool): Display informational messages and the full list of parameters and timers at the end of execution.

Definition

@inline def mlpack_nbc[F, R, H] = ext_ml_train[:mlpack_nbc, F, R, H]

mlpack_nbc_predict

mlpack_nbc_predict[M, F, N]

Given a Naive Bayes classifier model trained with mlpack_nbc[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_nbc[] for more details.

Inputs:

  • M: Naive Bayes classification model to use for prediction; must be the result of a previous mlpack_nbc[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_nbc_predict[M, F, N] = ext_ml_predict[:mlpack_nbc_predict, M, F, N]

mlpack_perceptron

mlpack_perceptron[F, R, H]

An implementation of a perceptron—a single level neural network—for classification. Given labeled data, a perceptron can be trained.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • max_iterations (Int): The maximum number of iterations the perceptron is to be run. Default 1000.
  • verbose (Bool): Display informational messages and the full list of parameters and

Definition

@inline def mlpack_perceptron[F, R, H] = ext_ml_train[:mlpack_perceptron, F, R, H]

mlpack_perceptron_predict

mlpack_perceptron_predict[M, F, N]

Given a perceptron model trained with mlpack_perceptron[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_perceptron[] for more details.

Inputs:

  • M: Perceptron model to use for prediction; must be the result of a previous mlpack_perceptron[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_perceptron_predict[M, F, N] =
ext_ml_predict[:mlpack_perceptron_predict, M, F, N]

mlpack_random_forest

mlpack_random_forest[F, R, H]

An implementation of the standard random forest algorithm by Leo Breiman for classification. Given labeled data, a random forest can be trained.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • maximum_depth (Int): Maximum depth of the tree (0 means no limit). Default 0.
  • minimum_gain_split (Float64): Minimum gain needed to make a split when building a tree. Default 0.0.
  • minimum_leaf_size (Int): Minimum number of points in each leaf node. Default 1.
  • num_trees (Int): Number of trees in the random forest. Default 10.
  • print_training_accuracy (Bool): If set, then the accuracy of the model on the training set will be predicted (verbose must also be specified).
  • seed (Int): Random seed. If 0, ‘std::time(NULL)’ is used. Default 0.
  • subspace_dim (Int): Dimensionality of random subspace to use for each split. 0 will autoselect the square root of data dimensionality. Default 0.
  • verbose (Bool): Display informational messages and the full list of parameters and

Definition

@inline def mlpack_random_forest[F, R, H] =
ext_ml_train[:mlpack_random_forest, F, R, H]

mlpack_random_forest_predict

mlpack_random_forest_predict[M, F, N]

Given a random forest model trained with mlpack_random_forest[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_random_forest[] for more details.

Inputs:

  • M: random forest model to use for prediction; must be the result of a previous mlpack_random_forest[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_random_forest_predict[M, F, N] =
ext_ml_predict[:mlpack_random_forest_predict, M, F, N]

mlpack_softmax_regression

mlpack_softmax_regression[F, R, H]

An implementation of softmax regression for classification, which is a multiclass generalization of logistic regression. Given labeled data, a softmax regression model can be trained and saved for future use.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of features to learn on
  • R: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • lambda (Float64): L2-regularization constant. Default 0.0001.
  • max_iterations (Int): Maximum number of iterations before termination. Default 400.
  • no_intercept (Bool): Do not add the intercept term to the model.
  • number_of_classes (Int): Number of classes for classification; if unspecified (or 0), the number of classes found in the labels will be used. Default 0.
  • verbose (Bool): Display informational messages and the full list of parameters and

Definition

@inline def mlpack_softmax_regression[F, R, H] =
ext_ml_train[:mlpack_softmax_regression, F, R, H]

mlpack_softmax_regression_predict

mlpack_softmax_regression_predict[M, F, N]

Given a softmax regression model trained with mlpack_softmax_regression[], make class predictions on a test set.

See also the mlpack documentation and the documentation for mlpack_softmax_regression[] for more details.

Inputs:

  • M: softmax regression model to use for prediction; must be the result of a previous mlpack_softmax_regression[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def mlpack_softmax_regression_predict[M, F, N] =
ext_ml_predict[:mlpack_softmax_regression_predict, M, F, N]

mlpack_knn_build

mlpack_knn_build[R, N, H]

An implementation of k-nearest-neighbor search using single-tree and dual-tree algorithms. Given a set of reference points and query points, this can build trees that can be used in later calls to mlpack_knn[].

See also the mlpack documentation and the documentation for mlpack_knn[] for more details.

Inputs:

  • R: relation of reference points that tree should be built on
  • N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • leaf_size (Int): Leaf size for tree building (used for kd-trees, vp trees, random projection trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, spill trees, and octrees). Default 20.
  • random_basis (Bool): before tree-building, project the data onto a random orthogonal basis. Default false.
  • rho (Float64): Balance threshold (only valid for spill trees). Default 0.7.
  • tau (Float64): Overlapping size (only valid for spill trees). Default 0.
  • tree_type (String): Type of tree to use: “kd”, “vp”, “rp”, “max-rp”, “ub”, “cover”, “r”, “r-star”, “x”, “ball”, “hilbert-r”, “r-plus”, “r-plus-plus”, “spill”, “oct”. Default "kd".
  • verbose (Bool): Display informational messages.

Returns:

  • A KNN model that can be used in a later call to mlpack_knn[].

Definition

@inline def mlpack_knn_build[R, N, H] = ext_ml_build[:mlpack_knn, R, N, H]

mlpack_knn

mlpack_knn[K, M, Q, N, H]

Perform k-nearest-neighbor search on a relation Q containing query points, using a model M that was built with mlpack_knn_build[].

See also the mlpack documentation and the documentation for mlpack_knn_build[] for more details.

Inputs:

  • K: constant representing number of nearest neighbors to search for.
  • M: pretrained model for kNN; must be the result of a previous mlpack_knn_build[] call.
  • Q: relation of query points; must have the same number of keys as the relation that M was built with.
  • N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • algorithm (String): Type of neighbor search: “naive”, “single_tree”, “dual_tree”, “greedy”. Default "dual_tree".
  • epsilon (Float64): If specified, will do approximate nearest neighbor search with given relative error. Default 0.
  • verbose (Bool): Display informational messages.

Returns:

  • A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (query_keys…, k, reference_keys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th nearest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_knn[K, M, Q, N, H] = ext_ml_transform[:mlpack_knn, K, M, Q, N, H]

mlpack_lsh_build

mlpack_lsh_build[R, N, H]

An implementation of approximate k-nearest-neighbor search with locality-sensitive hashing (LSH). Given a set of reference points, this will build an LSH model.

See also the mlpack documentation and the documentation for mlpack_lsh[] for more details.

Inputs:

  • R: relation of reference points that tree should be built on
  • N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • bucket_size (Int): The size of a bucket in the second level hash. Default 500.
  • hash_width (Float64): The hash width for the first-level hashing in the LSH preprocessing. By default, the LSH class automatically estimates a hash width for its use.
  • projections (Int): The number of hash functions for each table. Default 10.
  • second_hash_size (Int): The size of the second level hash table. Default 99901.
  • seed (Int): Random seed. If 0, ‘std::time(NULL)’ is used. Default 0.
  • tables (Int): The number of hash tables to be used. Default 30.
  • verbose (Bool): Display informational messages.

Returns:

  • An LSH model that can be used in a later call to mlpack_lsh[].

Definition

@inline def mlpack_lsh_build[R, N, H] = ext_ml_build[:mlpack_lsh, R, N, H]

mlpack_lsh

mlpack_lsh[K, M, Q, N, H]

Perform approximate k-nearest-neighbor search on a relation Q containing query points, using a model M that was built with mlpack_lsh_build[].

See also the mlpack documentation and the documentation for mlpack_lsh_build[] for more details.

Inputs:

  • K: constant representing number of nearest neighbors to search for.
  • M: pretrained model for kNN; must be the result of a previous mlpack_knn_build[] call.
  • Q: relation of query points; must have the same number of keys as the relation that M was built with.
  • N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • num_probes (Int): Number of additional probes for multiprobe LSH; if 0, traditional LSH is used. Default 0.
  • verbose (Bool): Display informational messages.

Returns:

  • A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (query_keys…, k, reference_keys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th nearest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_lsh[K, M, Q, N, H] = ext_ml_transform[:mlpack_lsh, K, M, Q, N, H]

mlpack_kfn_build

mlpack_kfn_build[R, N, H]

An implementation of k-furthest-neighbor search using single-tree and dual-tree algorithms. This can build a tree that can be saved for future use.

See also the mlpack documentation and the documentation for mlpack_kfn[] for more details.

Inputs:

  • R: relation of reference points that tree should be built on
  • N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • leaf_size (Int): Leaf size for tree building (used for kd-trees, vp trees, random projection trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, and octrees). Default 20.
  • random_basis (Bool): Before tree-building, project the data onto a random orthogonal basis. Default false.
  • seed (Int): Random seed (if 0, std::time(NULL) is used). Default 0.
  • tree_type (String): Type of tree to use: "kd", "vp", "rp", "max-rp", "ub", "cover", "r", "r-star", “x”, “ball”, “hilbert-r”, “r-plus”,“r-plus-plus”, “oct”. Default “kd”`.
  • verbose (Bool): Display informational messages.

Returns:

  • A KFN model that can be used with a later call to mlpack_kfn[].

Definition

@inline def mlpack_kfn_build[R, N, H] = ext_ml_build[:mlpack_kfn, R, N, H]

mlpack_kfn

mlpack_kfn[K, M, Q, N, H]

Perform k-furthest-neighbor search on a relation Q containing query points, using a model M that was built with mlpack_kfn_build[].

See also the mlpack documentation and the documentation for mlpack_kfn_build[] for more details.

Inputs:

  • K: constant representing number of nearest neighbors to search for.
  • M: pretrained model for kNN; must be the result of a previous mlpack_knn_build[] call.
  • Q: relation of query points; must have the same number of keys as the relation that M was built with.
  • N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • algorithm (String): Type of neighbor search: “naive”, “single_tree”, “dual_tree”, “greedy”. Default "dual_tree".
  • epsilon (Float64): If specified, will do approximate nearest neighbor search with given relative error. Default 0.
  • percentage (Float64): If specified, will do approximate furthest neighbor search. Must be in the range (0,1] (decimal form). Resultant neighbors will be at least (p*100)% of the distance as the true furthest neighbor. Default 1.
  • verbose (Bool): Display informational messages.

Returns:

  • A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (query_keys…, k, reference_keys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th furthest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_kfn[K, M, Q, N, H] = ext_ml_transform[:mlpack_kfn, K, M, Q, N, H]

mlpack_approx_kfn_build

mlpack_approx_kfn_build[R, N, H]

An implementation of two strategies for furthest neighbor search. This creates a furthest neighbor search model that can be reused later.

See also the mlpack documentation and the documentation for mlpack_approx_kfn[] for more details.

Inputs:

  • R: relation of reference points that tree should be built on
  • N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • algorithm (String): Algorithm to use: "ds" or "qdafn". Default "ds".
  • num_projections (Int): Number of projections to use in each hash table. Default 5.
  • num_tables (Int): Number of hash tables to use. Default 5.
  • verbose (Bool): Display informational messages.

Returns:

  • An approximate KFN model that can be used in a later call to mlpack_approx_kfn[].

Definition

@inline def mlpack_approx_kfn_build[R, N, H] = ext_ml_build[:mlpack_approx_kfn, R, N, H]

mlpack_approx_kfn

mlpack_approx_kfn[K, M, Q, N, H]

Perform approximate k-furthest-neighbor search on a relation Q containing query points, using a model M that was build with mlpack_approx_kfn_build[].

See also the mlpack documentation for more details.

Inputs:

  • K: constant representing number of nearest neighbors to search for.
  • M: pretrained model for kNN; must be the result of a previous mlpack_knn_build[] call.
  • Q: relation of query points; must have the same number of keys as the relation that M was built with.
  • N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • calculate_error (Bool): If set, calculate and display the average distance error for the first furthest neighbor only.
  • verbose (Bool): Display informational messages.

Returns:

  • A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (query_keys…, k, reference_keys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th approximate furthest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_approx_kfn[K, M, Q, N, H] =
ext_ml_transform[:mlpack_approx_kfn, K, M, Q, N, H]

mlpack_dbscan

mlpack_dbscan[F, N, H]

An implementation of DBSCAN clustering. Given a dataset, this can compute and return a clustering of that dataset.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of data points to cluster.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • epsilon (Float64): Radius of each range search. Default 1.
  • min_size (Int): Minimum number of points for a cluster. Default 5.
  • naive (Bool): If set, brute-force range search (not tree-based) will be used. Default false.
  • selection_type (String): If using point selection policy, the type of selection to use ("ordered", "random"). Default "ordered".
  • single_mode (Bool): If set, single-tree range search (not dual-tree) will be used. Default false.
  • tree_type (String): If using single-tree or dual-tree search, the type of tree to use ("kd", "r", "r-star", "x", "hilbert-r", "r-plus", "r-plus-plus", "cover", "ball"). Default "kd".
  • verbose (Bool): Display informational messages.

Returns:

  • A relation containing the keys in F, with a cluster assignment (Int) as the last argument. If the point is considered “noise” (i.e. not part of any cluster), the cluster assignment is 0.

Definition

@inline def mlpack_dbscan[F, N, H] = ext_ml_transform[:mlpack_dbscan, 0, {()}, F, N, H]

mlpack_kmeans

mlpack_kmeans[K, F, N, H]

An implementation of several strategies for efficient k-means clustering. Given a dataset and a value of k, this computes and returns a k-means clustering on that data.

See also the mlpack documentation for more details.

Inputs:

  • K: constant indicating the number of clusters for k-means clustering.
  • F: relation of data points to cluster.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • algorithm (String): Algorithm to use for the Lloyd iteration ("naive", "pelleg-moore", "elkan", "hamerly", "dualtree", or "dualtree-covertree"). Default "naive".
  • allow_empty_clusters (Bool): Allow empty clusters to persist. Default false.
  • kill_empty_clusters (Bool): Remove empty clusters when they occur. Default false.
  • max_iterations (Int): Maximum number of iterations before k-means terminates. Default 1000.
  • percentage (Float64): Percentage of dataset to use for each refined start sampling (use when refined_start is specified). Default 0.02.
  • refined_start (Bool): Use the refined initial point strategy by Bradley and Fayyad to choose initial points. Default false.
  • samplings (Int): Number of samplings to perform for refined start (use when refined_start is specified). Default 100.
  • seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
  • verbose (Bool): Display information messages. Default false.

Returns:

  • A relation containing the keys in F with a cluster assignment (Int) between 1 and K as the last argument.

Definition

@inline def mlpack_kmeans[K, F, N, H] = ext_ml_transform[:mlpack_kmeans, K, {()}, F, N, H]

mlpack_kmeans_centroids

mlpack_kmeans_centroids[K, F, N, H]

An implementation of several strategies for efficient k-means clustering. Given a dataset and a value of k, this computes and returns centroids for a k-means clustering on that data.

See also the mlpack documentation for more details.

Inputs:

  • K: constant indicating the number of clusters for k-means clustering.
  • F: relation of data points to cluster.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • algorithm (String): Algorithm to use for the Lloyd iteration ("naive", "pelleg-moore", "elkan", "hamerly", "dualtree", or "dualtree-covertree"). Default "naive".
  • allow_empty_clusters (Bool): Allow empty clusters to persist. Default false.
  • kill_empty_clusters (Bool): Remove empty clusters when they occur. Default false.
  • max_iterations (Int): Maximum number of iterations before k-means terminates. Default 1000.
  • percentage (Float64): Percentage of dataset to use for each refined start sampling (use when refined_start is specified). Default 0.02.
  • refined_start (Bool): Use the refined initial point strategy by Bradley and Fayyad to choose initial points. Default false.
  • samplings (Int): Number of samplings to perform for refined start (use when refined_start is specified). Default 100.
  • seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
  • verbose (Bool): Display information messages. Default false.

Returns:

  • A relation containing a cluster index between 1 and K that maps to the centroid of each dimension in F. So, the first argument of this relation is the cluster index, and the rest correspond to the arguments of F that are after the first N key arguments.

Definition

@inline def mlpack_kmeans_centroids[K, F, N, H] =
ext_ml_transform[:mlpack_kmeans_centroids, K, {()}, F, N, H]

mlpack_mean_shift

mlpack_mean_shift[F, N, H]

A fast implementation of mean-shift clustering using dual-tree range search. Given a dataset, this uses the mean shift algorithm to produce and return a clustering of the data.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of data points to cluster.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • force_convergence (Bool): If specified, the mean shift algorithm will continue running regardless of max_iterations until the clusters converge. Default false.
  • max_iterations (Int): Maximum number of iterations before mean shift terminates. Default 1000.
  • radius (Float64): If the distance between two centroids is less than the given radius, one will be removed. A radius of 0 or less means an estimate will be calculated and used for the radius. Default 0.
  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation containing the keys in F with a cluster assignment (Int) as the last element. If the key was not assigned to a cluster, the cluster assignment will be 0.

Definition

@inline def mlpack_mean_shift[F, N, H] =
ext_ml_transform[:mlpack_mean_shift, 0, {()}, F, N, H]

mlpack_gmm_train

mlpack_gmm_train[F, N, H]

An implementation of the EM algorithm for training Gaussian mixture models (GMMs). Given a dataset, this can train a GMM for future use with other tools.

See also the mlpack documentation for more details.

Inputs:

  • R: relation of reference points that model should be built on
  • N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • diagonal_covariance (Bool): Force the covariance of the Gaussians to be diagonal. This can accelerate training time significantly. Default false.
  • gaussians (Int): Number of Gaussians in the GMM. Required.
  • kmeans_max_iterations (Int): Maximum number of iterations for the k-means algorithm (used to initialize EM). Default 1000.
  • max_iterations (Int): Maximum number of iterations of EM algorithm (passing 0 will run until convergence). Default 250.
  • no_force_positive (Bool): Do not force the covariance matrices to be positive definite. Default false.
  • noise (Float64): Variance of zero-mean Gaussian noise to add to data. Default 0.
  • percentage (Float64): If using refined_start, specify the percentage of the dataset used for each sampling (should be between 0.0 and 1.0). Default 0.02.
  • refined_start (Bool): During the initialization, use refined initial positions for k-means clustering (Bradley and Fayyad, 1998). Default false.
  • samplings (Int): If using refined_start, specify the number of samplings used for initial points. Default 100.
  • seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
  • tolerance (Float64): Tolerance for convergence of EM. Default 1e-10.
  • trials (Int): Number of trials to perform in training GMM. Default 1.
  • verbose (Bool): Display informational messages. Default false.

Definition

@inline def mlpack_gmm_train[F, N, H] = ext_ml_build[:mlpack_gmm_train, F, N, H]

mlpack_gmm_generate

mlpack_gmm_generate[S, M, D, H]

A sample generator for pre-trained GMMs. Given a pre-trained GMM, this can sample new points randomly from that distribution.

See also the mlpack documentation for more details.

Inputs:

  • S: constant indicating the number of samples to generate.
  • M: pre-trained GMM from mlpack_gmm_train[].
  • D: constant representing the dimensionality of the model (i.e. the dimensionality of F in the call to mlpack_gmm_train[]).
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation containing S samples from the given GMM M. The first argument is the key (an integer between 1 and S) and the rest of the arguments are each of the features.

Definition

@inline def mlpack_gmm_generate[S, M, D, H] =
ext_ml_transform[:mlpack_gmm_generate, S, M, {()}, D, H]

mlpack_gmm_probability

mlpack_gmm_probability[M, F, N, H]

A probability calculator for GMMs. Given a pre-trained GMM and a set of points, this can compute the probability that each point is from the given GMM.

See also the mlpack documentation for more details.

Inputs:

  • M: pre-trained GMM from mlpack_gmm_train[].
  • F: relation of data points to compute the probabilities of.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation containing the keys of F (that is, the first N arguments) mapping to the probability that each of those samples arose from the GMM M.

Definition

@inline def mlpack_gmm_probability[M, F, N, H] =
ext_ml_transform[:mlpack_gmm_probability, 0, M, F, N, H]

mlpack_emst

mlpack_emst[F, N, H]

An implementation of the Dual-Tree Boruvka algorithm for computing the Euclidean minimum spanning tree of a set of input points.

See also the mlpack documentation for more details.

Inputs:

  • F: relation of data points to compute the minimum spanning tree of.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when clustering the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • leaf_size (Int): Leaf size in the kd-tree. One-element leaves give the empirically best performance, but at the cost of greater memory requirements. Default 1.
  • naive (Bool): Compute the MST using O(n^2) naive algorithm. Default false.
  • verbose (Bool): Display informational messages. Default false.

Returns:

  • An ordered edge relation with weights. Specifically, each point in F is associated with a set of N keys. The first argument of the output relation is the index of the edge (starting from 1); lower-weighted edges have lower indices. The following N arguments of the output relation correspond to the first vertex; the following N argumentss of the output relation correspond to the second vertex; and the last argument represents the distance between those two vertices.

Definition

@inline def mlpack_emst[F, N, H] = ext_ml_transform[:mlpack_emst, 0, {()}, F, N, H]

mlpack_fastmks_build

mlpack_fastmks_build[R, N, H]

An implementation of max-kernel search using single-tree and dual-tree algorithms. Given a set of reference points and query points, this can build trees that can be used in later calls to mlpack_fastmks[].

See also the mlpack documentation and the documentation for mlpack_fastmks[] for more details.

Inputs:

  • R: relation of reference points that tree should be built on
  • N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • bandwidth (Float64): Bandwidth (for Gaussian, Epanechnikov, and triangular kernels). Default 1.
  • base (Float64): Base to use during cover tree construction. Default 2.
  • degree (Float64): Degree of polynomial kernel. Default 2.
  • kernel (String): Kernel type to use: "linear", "polynomial", "cosine", "gaussian", "epanechnikov", "triangular", "hyptan". Default "linear".
  • naive (Bool): If true, O(n^2) naive mode is used for computation. Default false.
  • offset (Float64): Offset of kernel (for polynomial and hyptan kernels). Default 0.
  • scale (Float64): Scale of kernel (for hyptan kernel). Default 1.
  • single (Bool): If true, single-tree search is used (as opposed to dual-tree search. Default false.
  • verbose (Bool): Display informational messages.

Returns:

  • A FastMKS model that can be used in a later call to mlpack_fastmks[].

Definition

@inline def mlpack_fastmks_build[R, N, H] = ext_ml_build[:mlpack_fastmks, R, N, H]

mlpack_fastmks

mlpack_fastmks[K, M, Q, N, H]

Perform max-kernel search search on a relation Q containing query points, using a model M that was built with mlpack_fastmks_build[].

See also the mlpack documentation and the documentation for mlpack_fastmks_build[] for more details.

Inputs:

  • K: constant representing number of max kernels to search for.
  • M: pretrained model for kNN; must be the result of a previous mlpack_fastmks_build[] call.
  • Q: relation of query points; must have the same number of keys as the relation that M was built with.
  • N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • verbose (Bool): Display informational messages.

Returns:

  • A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (query_keys…, k, reference_keys…, kernel) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th max-kernel, and kernel is the kernel value between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_fastmks[K, M, Q, N, H] = ext_ml_transform[:mlpack_fastmks, K, M, Q, N, H]

mlpack_krann_build

mlpack_krann_build[R, N, H]

An implementation of rank-approximate k-nearest-neighbor search (kRANN) using single-tree and dual-tree algorithms. Given a set of reference points and query points, this can build trees that can be used in later calls to mlpack_krann[].

See also the mlpack documentation and the documentation for mlpack_krann[] for more details.

Inputs:

  • R: relation of reference points that tree should be built on
  • N: constant indicating the number of arguments in R that correspond to keys (i.e. dimensions that should not be considered when building the model).
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • first_leaf_exact (Bool): The flag to trigger sampling only after exactly exploring the first leaf. Default false.
  • leaf_size (Int): Leaf size for tree building (used for kd-trees, UB trees, R trees, R* trees, X trees, Hilbert R trees, R+ trees, R++ trees, and octrees). Default 20.
  • naive (Bool): If true, sampling will be done without using a tree. Default false.
  • random_basis (Bool): Before tree-building, project the data onto a random orthogonal basis. Default false.
  • sample_at_leaves (Bool): The flag to trigger sampling at leaves. Default false.
  • seed (Int): Random seed (if 0, std::time(NULL) is used). Default 0.
  • single_mode (Bool): If true, single-tree search is used (as opposed to dual-tree search). Default false.
  • single_sample_limit (Int): The limit on the maximum number of samples (and hence the largest node you can approximate). Default 20.
  • tree_type (String): Type of tree to use: "kd", "ub", "cover", "r", "x", "r-star", "hilbert-r", "r-plus", "r-plus-plus", "oct". Default "kd".
  • verbose (Bool): Display informational messages.

Returns:

  • A rank-approximate KNN model that can be used in a later call to mlpack_krann[].

Definition

@inline def mlpack_krann_build[R, N, H] = ext_ml_build[:mlpack_krann, R, N, H]

mlpack_krann

mlpack_krann[K, M, Q, N, H]

Perform k-rank-approximate-nearest-neighbor search on a relation Q containing query points, using a model M that was built with mlpack_krann_build[].

See also the mlpack documentation and the documentation for mlpack_krann_build[] for more details.

Inputs:

  • K: constant representing number of nearest neighbors to search for.
  • M: pretrained model for kRANN; must be the result of a previous mlpack_krann_build[] call.
  • Q: relation of query points; must have the same number of keys as the relation that M was built with.
  • N: constant indicating the number of arguments in Q that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • alpha (Float64): The desired success probability. Default 0.95.
  • tau (Float64): The allowed rank-error in terms of the percentile of the data. Default 5.
  • verbose (Bool): Display informational messages.

Returns:

  • A relation mapping keys from Q to keys in the reference set that the model M was built on. The form is (query_keys…, k, reference_keys…, distance) where k takes values between 1 and K for each possible set of query_keys.... Given query_keys... and k, then reference_keys... is the set of keys associated with the k‘th rank-approximate nearest neighbor, and distance is the Euclidean distance between the point associated with query_keys... and the point associated with reference_keys....

Definition

@inline def mlpack_krann[K, M, Q, N, H] = ext_ml_transform[:mlpack_krann, K, M, Q, N, H]

mlpack_det_build

mlpack_det_build[F, N, H]

An implementation of density estimation trees for the density estimation task. Density estimation trees can be trained with this native.

See also the mlpack documentation and the documentation for mlpack_det[] for more details.

Inputs:

  • F: relation of features to build density estimation tree on.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • folds (Int): The number of folds of cross-validation to perform for the estimation (0 is LOOCV). Default 10.
  • max_leaf_size (Int): The maximum size of a leaf in the unpruned, fully grown DET. Default 10.
  • min_leaf_size (Int): The minimum size of a leaf in the unpruned, fully grown DET. Default 5.
  • skip_pruning (Bool): Whether to bypass the pruning process and output the unpruned tree only. Default false.
  • verbose (Bool): Display informational messages. Default false.

Definition

@inline def mlpack_det_build[F, N, H] = ext_ml_build[:mlpack_det, F, N, H]

mlpack_det

mlpack_det[M, F, N, H]

Given a DET trained with mlpack_det_build[], compute densities of the query points in the relation F.

See also the mlpack documentation and the documentation for mlpack_det[] for more details.

Inputs:

  • M: pretrained DET model; must be the result of a previous mlpack_det_build[] call.
  • F: relation of features to compute density estimates for.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation mapping keys from F (i.e. the first N elements of the tuples in F) to their density estimates.

Definition

@inline def mlpack_det[M, F, N, H] = ext_ml_transform[:mlpack_det, 0, M, F, N, H]

mlpack_nmf

mlpack_nmf[R, F, N, H]

An implementation of non-negative matrix factorization. This can be used to decompose an input dataset into two low-rank non-negative components.

See also the mlpack documentation for more details.

Inputs:

  • R: constant indicating the rank of the low-rank decomposition.
  • F: relation of features to decompose into two low-rank matrices.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • max_iterations (Int): Number of iterations before NMF terminates (0 runs until convergence). Default 10000.
  • min_residue (Float64): The minimum root mean square residue allowed for each iteration, below which the program terminates. Default 1e-05.
  • seed (Int): Random seed. If 0, std::time(NULL) is used. Default 0.
  • update_rules (String): Update rules for each iteration; ( "multdist" | "multdiv" | "als" ). Default "multdist".
  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation encoding both the low-rank H and W matrices. However, it is slightly confusing because rows in W are keyed by the first N arguments of F, but rows in H are keyed by integers. Thus, the first argument of the output is either 1 if the tuple corresponds to a row of W or 2 if the tuple corresponds to a row of H. Then, if the first argument is 1, then the next N arguments are keys from F; otherwise they are zero values that should be ignored. After that, if the first argument is 2, the next argument is the (Int) row index for tuples pertaining to H; otherwise they are zero values that should be ignored. The following argument is the (Int) index of the argument that the tuple pertains to in W or H. The last argument is the (Float64) value in either W or H referenced by the previous arguments.

In some sense, that return format can be understood as an “interleaved sparse representation” of W and H. We are forced to do this in part because Rel cannot currently return two relations easily from one call.

Definition

@inline def mlpack_nmf[R, F, N, H] = ext_ml_transform[:mlpack_nmf, R, {()}, F, N, H]

mlpack_kernel_pca

mlpack_kernel_pca[D, F, N, H]

An implementation of Kernel Principal Components Analysis (KPCA). This can be used to perform nonlinear dimensionality reduction or preprocessing on a given dataset.

See also the mlpack documentation for more details.

Input options:

  • D: constant indicating the desired new dimensionality of the data.
  • F: relation of features to perform kernel PCA on.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • bandwidth (Float64): Bandwidth, for "gaussian" and "laplacian" kernels. Default 1.
  • center (Bool): If set, the transformed data will be centered about the origin. Default false.
  • degree (Float64): Degree of polynomial, for ‘polynomial’ kernel. Default 1.
  • kernel (String): The kernel to use; see the linked documentation for the list of usable kernels. Default "gaussian".
  • kernel_scale (Float64): Scale, for "hyptan" kernel. Default 1.
  • nystroem_method (Bool): If set, the Nystroem method will be used. Default false.
  • offset (Float64): Offset, for "hyptan" and "polynomial" kernels. Default 0.
  • sampling (String): Sampling scheme to use for the Nystroem method: "kmeans", "random", "ordered". Default "kmeans".
  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation mapping keys in F (i.e. the first N arguments of F) to D values in each dimension.

Definition

@inline def mlpack_kernel_pca[D, F, N, H] =
ext_ml_transform[:mlpack_kernel_pca, D, {()}, F, N, H]

mlpack_pca

mlpack_pca[D, F, N, H]

An implementation of several strategies for principal components analysis (PCA), a common preprocessing step. Given a dataset and a desired new dimensionality, this can reduce the dimensionality of the data using the linear transformation determined by PCA.

See also the mlpack documentation for more details.

Input options:

  • D: constant indicating the desired new dimensionality of the data.
  • F: relation of features to perform PCA on.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • decomposition_method (String): Method used for the principal components analysis: "exact", "randomized", "randomized-block-krylov", "quic". Default "exact".
  • scale (Bool): If set, the data will be scaled before running PCA, such that the variance of each feature is 1. Default false.
  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation mapping keys in F (i.e. the first N arguments of F) to D values in each dimension.

Definition

@inline def mlpack_pca[D, F, N, H] = ext_ml_transform[:mlpack_pca, D, {()}, F, N, H]

mlpack_preprocess_split

mlpack_preprocess_split[F, H]

This utility takes a dataset and splits it into a training set and a test set. Before the split, the points in the dataset are randomly reordered. The percentage of the dataset to be used as the test set can be specified with the test_ratio parameter; the default is 0.2 (20%).

See also the mlpack documentation for more details.

Input options:

  • F: relation of features to split. If you want to split labels too, they should be included in this relation.
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • no_shuffle (Bool): Avoid shuffling the data before splitting. Default false.
  • seed (Int): Random seed (0 for std::time(NULL)). Default 0.
  • test_ratio (Float64): Ratio of test set; if not set, the ratio defaults to 0.2.
  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation F with membership in the training or test set prepended. So, if (t...) was a tuple in F, (set, t...) will be returned where set is 1 if the point t... is a part of the training set, and set is 2 if the point is a part of the test set.

Definition

@inline def mlpack_preprocess_split[F, H] =
ext_ml_transform[:mlpack_preprocess_split, 0, {()}, F, 0, H]

mlpack_radical

mlpack_radical[F, N, H]

An implementation of RADICAL, a method for independent component analysis (ICA). Given a dataset, this can decompose the dataset into an independent component matrix; this can be useful for preprocessing.

See also the mlpack documentation for more details.

Input options:

  • F: relation of features to perform RADICAL on.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when transforming the data).
  • H: relation of hyperparameters encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters:

  • angles (Int): Number of angles to consider in brute-force search during Radical2D. Default 150.
  • noise_std_dev (Float64): Standard deviation of Gaussian noise. Default 0.175.
  • objective (Bool): If set, an estimate of the final objective function is printed. Default false.
  • replicates (Int): Number of Gaussian-perturbed replicates to use (per point) in Radical2D. Default 30.
  • sweeps (Int): Number of sweeps; each sweep calls Radical2D once for each pair of dimensions. Default 0.
  • verbose (Bool): Display informational messages. Default false.

Returns:

  • A relation mapping keys in F (i.e. the first N arguments of F) to independent component values in each dimension.

Definition

@inline def mlpack_radical[F, N, H] = ext_ml_transform[:mlpack_radical, 0, {()}, F, N, H]

glm_linear_regression

glm_linear_regression[F, R]

A binding of the GLM.jl function lm. Fits a linear regression model given features F and responses R.

Note that this is unregularized linear regression, so if your model does not converge (e.g. training gives a PosDefException), try using regularized linear regression, perhaps via mlpack_linear_regression[] with the lambda hyperparameter set, or ensure that the columns of your data are not linearly dependent.

Input options:

  • F: Relation of features to perform linear regression on.
  • R: Relation of responses to train the linear regression model on.

Returns:

  • A GLM model that can later be used with glm_predict.

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_linear_regression[features, responses]

Definition

@inline def glm_linear_regression[F, R] = ext_ml_train[:glm_generic, F, R,
{ a, b : (a = "family" and b = "Normal") or (a = "link" and b = "IdentityLink")}
]

glm_logistic_regression

glm_logistic_regression[F, R]

A binding of the GLM.jl function glm with the binomial family and Logit link. Fits a logistic regression model given features F and responses R.

Input options:

  • F: Relation of features to perform logistic regression on.
  • R: Relation of responses to train the logistic regression model on.

Returns:

  • A GLM model that can later be used with glm_predict.

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_logistic_regression[features, responses]

Definition

@inline def glm_logistic_regression[F, R] = ext_ml_train[:glm_generic, F, R,
{ a, b : (a = "family" and b = "Binomial") or (a = "link" and b = "LogitLink")}
]

glm_probit_regression

glm_probit_regression[F, R]

A binding of the GLM.jl function glm with the binomial family and Probit link. Fits a probit regression model given features F and responses R.

Input options:

  • F: Relation of features to perform probit regression on.
  • R: Relation of responses to train the probit regression model on.

Returns:

  • A GLM model that can later be used with glm_predict.

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_probit_regression[features, responses]

Definition

@inline def glm_probit_regression[F, R] = ext_ml_train[:glm_generic, F, R,
{ a, b : (a = "family" and b = "Binomial") or (a = "link" and b = "ProbitLink")}
]

glm_generic

glm_generic[F, R, H]

A binding of the GLM.jl function glm. Fits a generalized linear model given features F and responses R and family and link passed in hyperparameters H. The supported families and links are listed below.

Input options:

  • F: Relation of features to perform a GLM regression on.
  • R: Relation of responses to train the GLM regression model on.
  • H: Relation of hyperparameters to specify the family and link to use to generate the generalized linear model. Example: H = {("family","Normal"); ("link","IdentityLink")}. Families supported: ["Binomial", "Bernoulli", "Binomial", "Gamma", "InverseGaussian", "NegativeBinomial", "Normal", "Poisson"]. Links supported: ["CauchitLink", "CloglogLink", "IdentityLink", "InverseLink", "InverseSquareLink", "LogitLink", "LogLink", "ProbitLink", "SqrtLink"].

Returns:

  • A GLM model that can later be used with glm_predict[].

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def hyperparams = {("family", "NegativeBinomial"); ("link", "LogLink")}
def model = glm_generic[features, responses, hyperparams]

Definition

@inline def glm_generic[F, R, H] = ext_ml_train[:glm_generic, F, R, H]

glm_predict

glm_predict[M, F, N]

A binding of the GLM.jl function predict. Uses a generalized linear model M to generate predictions for features F. Here, M can be produced from any of the definitions glm_linear_regression, glm_logistic_regression, glm_probit_regression, or glm_generic.

Input options:

  • M: Relation containing the model generated by running one of the generalized linear models previously (e.g. glm_linear_regression or glm_generic).
  • F: Relation of features to generate the predictions given the previously computed model.
  • N: constant indicating the number of arguments in F that correspond to keys (i.e. dimensions that should not be considered when computing predictions).

Returns:

  • Predictions of the features F after being fit with the model M.

Example:

def features = {(1, 1.0); (2, 2.0); (3, 3.0); (4, 4.0); (5, 5.0)}
def responses = {(1, 0); (2, 0); (3, 0); (4, 1); (5, 1)}
def model = glm_probit_regression[features, responses]
def predictions = glm_predict[model, features, 1]

Definition

@inline def glm_predict[M, F, N] = ext_ml_predict[:glm_predict, M, F, N]

xgboost_classifier

xgboost_classifier[F, L, H]

A binding of the xgboost() function to train an XGBoost model (via XGBoost.jl). This fits a boosted tree model with the XGBoost algorithm to the features F and labels L, using hyperparameters specified in the relation H.

If you would like to train a regression model with XGBoost, see xgboost_regressor[].

See also the XGBoost documentation for each hyperparameter.

Note that there are very many hyperparameters… all are optional. Here, we only provide documentation for common parameters. More documentation on each of these parameters is available in the link above, as well as documentation for many other less common hyperparameters not listed here.

Inputs:

  • F: relation of features to learn on
  • L: relation of labels; the last variable should be the label; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters (incomplete list):

  • num_round (Int): Number of rounds of boosting to perform. (Default 50.)
  • booster (String): Which booster to use. Can be "gbtree", "gblinear" or "dart"; "gbtree" and "dart" use tree based models while "gblinear" uses linear functions. (Default `“gbtree”.)
  • verbosity (Int): Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). (Default 1.)
  • objective (String): Specify the learning task and the corresponding learning objective. Valid options include "binary:logistic", "binary:hinge", "multi:softmax", and other classification objectives listed in the XGBoost documentation. (Default "multi:softmax".)
  • base_score (Float64): The initial prediction score of all instances. (Default 0.5.)
  • eval_metric (String): Evaluation metrics for validation data. Valid choices include "merror", "error", "logloss", "auc", "aucpr", "ndcg", "map", and other classification evaluation metrics specified in the XGBoost documentation. (Default set based on objective value.)
  • seed (Int): Random number seed. (Default 0.)
  • eta (Float64): Step size shrinkage used in update to prevent overfitting. (Default 0.3.)
  • gamma (Float64): Minimum loss reduction required to make a further partition on a leaf node of the tree. (Default 0.0.)
  • max_depth (Int): Maximum depth of a tree. (Default 6.)
  • min_child_weight (Float64): Minimum sum of instance weight (hessian) needed in a child. (Default 1.0.)
  • max_delta_step (Float64): Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. (Default 0.0.)

Definition

@inline def xgboost_classifier[F, L, H] = ext_ml_train[:xgboost_classifier, F, L, H]

xgboost_classifier_predict

xgboost_classifier_predict[M, F, N]

Given an XGBoost classification model trained with xgboost_classifier[], make class predictions on a test set.

For more information, see the documentation for xgboost_classifier[] and the XGBoost documentation.

Inputs:

  • M: XGBoost classification model to use for prediction; must be the result of a previous xgboost_classifier[] call
  • F: relation of test features for which class predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def xgboost_classifier_predict[M, F, N] =
ext_ml_predict[:xgboost_classifier_predict, M, F, N]

xgboost_regressor

xgboost_regressor[F, R, H]

A binding of the xgboost() function to train an XGBoost regression model (via XGBoost.jl). This fits a boosted tree model with the XGBoost algorithm to the features F and responses R, using hyperparameters specified in the relation H.

If you would like to train a classification model with XGBoost, see xgboost_classifier[].

See also the XGBoost documentation for each hyperparameter.

Note that there are very many hyperparameters… all are optional. Here, we only provide documentation for common parameters. More documentation on each of these parameters is available in the link above, as well as documentation for many other less common hyperparameters not listed here.

Inputs:

  • F: relation of features to learn on
  • L: relation of responses; the last variable should be the response; everything else should be keys
  • H: relation of hyperparameters, encoded as (String, String); e.g., {("param1", "10"); ("param2", "true")}

Hyperparameters (incomplete list):

  • num_round (Int): Number of rounds of boosting to perform. (Default 50.)
  • booster (String): Which booster to use. Can be "gbtree", "gblinear" or "dart"; "gbtree" and "dart" use tree based models while "gblinear" uses linear functions. (Default "gbtree".)
  • verbosity (Int): Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). (Default 1.)
  • objective (String): Specify the learning task and the corresponding learning objective. Valid options include "reg:squarederror", "reg:squaredlogerror", "reg:logistic", "reg:psuedohubererror", "reg:gamma", "reg:tweedie", and other regression objectives listed in the XGBoost documentation. (Default "reg:squarederror".)
  • base_score (Float64): The initial prediction score of all instances. (Default 0.5.)
  • eval_metric (String): Evaluation metrics for validation data. Valid choices include "rmse", "rmsle", "mae", "mape", "mphe", and other regression evaluation metrics specified in the XGBoost documentation. (Default set based on objective value.)
  • seed (Int): Random number seed. (Default 0.)
  • eta (Float64): Step size shrinkage used in update to prevent overfitting. (Default 0.3.)
  • gamma (Float64): Minimum loss reduction required to make a further partition on a leaf node of the tree. (Default 0.0.)
  • max_depth (Int): Maximum depth of a tree. (Default 6.)
  • min_child_weight (Float64): Minimum sum of instance weight (hessian) needed in a child. (Default 1.0.)
  • max_delta_step (Float64): Maximum delta step we allow each leaf output to be. If the value is set to 0, it means there is no constraint. (Default 0.0.)

Definition

@inline def xgboost_regressor[F, R, H] = ext_ml_train[:xgboost_regressor, F, R, H]

xgboost_regressor_predict

xgboost_regressor_predict[M, F, N]

Given an XGBoost regression model trained with xgboost_regressor[], make regression predictions on a test set.

For more information, see the documentation for xgboost_regressor[] and the XGBoost documentation.

Inputs:

  • M: XGBoost regression model to use for prediction; must be the result of a previous xgboost_regressor[] call
  • F: relation of test features for which regression predictions will be computed
  • N: constant Int representing the number of keys in F

Definition

@inline def xgboost_regressor_predict[M, F, N] =
ext_ml_predict[:xgboost_regressor_predict, M, F, N]