% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Familiar.R
\name{train_familiar}
\alias{train_familiar}
\title{Create models using end-to-end machine learning}
\usage{
train_familiar(
  formula = NULL,
  data = NULL,
  experiment_data = NULL,
  cl = NULL,
  experimental_design = "fs+mb",
  learner = NULL,
  hyperparameter = NULL,
  verbose = TRUE,
  ...
)
}
\arguments{
\item{formula}{An R formula. The formula can only contain feature names and
dot (\code{.}). The \code{*} and \code{+1} operators are not supported as these refer to
columns that are not present in the data set.

Use of the formula interface is optional.}

\item{data}{A \code{data.table} object, a \code{data.frame} object, list containing
multiple \code{data.table} or \code{data.frame} objects, or paths to data files.

\code{data} should be provided if no file paths are provided to the \code{data_files}
argument. If both are provided, only \code{data} will be used.

All data is expected to be in wide format, and ideally has a sample
identifier (see \code{sample_id_column}), batch identifier (see \code{cohort_column})
and outcome columns (see \code{outcome_column}).

In case paths are provided, the data should be stored as \code{csv}, \code{rds} or
\code{RData} files. See documentation for the \code{data_files} argument for more
information.}

\item{experiment_data}{Experimental data may provided in the form of}

\item{cl}{Cluster created using the \code{parallel} package. This cluster is then
used to speed up computation through parallelisation. When a cluster is not
provided, parallelisation is performed by setting up a cluster on the local
machine.

This parameter has no effect if the \code{parallel} argument is set to \code{FALSE}.}

\item{experimental_design}{(\strong{required}) Defines what the experiment looks
like, e.g. \code{cv(bt(fs,20)+mb,3,2)} for 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building. The basic workflow components are:
\itemize{
\item \code{fs}: (required) feature selection step.
\item \code{mb}: (required) model building step.
\item \code{ev}: (optional) external validation. Setting this is not required for
\code{train_familiar}, but if validation batches or cohorts are present in the
dataset (\code{data}), these should be indicated in the \code{validation_batch_id}
argument.
}

The different components are linked using \code{+}.

Different subsampling methods can be used in conjunction with the basic
workflow components:
\itemize{
\item \code{bs(x,n)}: (stratified) .632 bootstrap, with \code{n} the number of
bootstraps. In contrast to \code{bt}, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
\item \code{bt(x,n)}: (stratified) .632 bootstrap, with \code{n} the number of
bootstraps. Unlike \code{bs} and other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
\item \code{cv(x,n,p)}: (stratified) \code{n}-fold cross-validation, repeated \code{p} times.
Pre-processing parameters are determined for each iteration.
\item \code{lv(x)}: leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
\item \code{ip(x)}: imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see the \code{imbalance_correction_method} parameter).
}

As shown in the example above, sampling algorithms can be nested.

The simplest valid experimental design is \code{fs+mb}. This is the default in
\code{train_familiar}, and will create one model for each feature selection
method in \code{fs_method}. To create more models, a subsampling method should
be introduced, e.g. \code{bs(fs+mb,20)} to create 20 models based on bootstraps
of the data.

This argument is ignored if the \code{experiment_data} argument is set.}

\item{learner}{(\strong{required}) Name of the learner used to develop a model. A
sizeable number learners is supported in \code{familiar}. Please see the
vignette on learners for more information concerning the available
learners. Unlike the \code{summon_familiar} function, \code{train_familiar} only
allows for a single learner.}

\item{hyperparameter}{(\emph{optional}) List, or nested list containing
hyperparameters for learners. If a nested list is provided, each sublist
should have the name of the learner method it corresponds to, with list
elements being named after the intended hyperparameter, e.g.
\code{"glm_logistic"=list("sign_size"=3)}

All learners have hyperparameters. Please refer to the vignette on learners
for more details. If no parameters are provided, sequential model-based
optimisation is used to determine optimal hyperparameters.

Hyperparameters provided by the user are never optimised. However, if more
than one value is provided for a single hyperparameter, optimisation will
be conducted using these values.}

\item{verbose}{Indicates verbosity of the results. Default is TRUE, and all
messages and warnings are returned.}

\item{...}{
  Arguments passed on to \code{\link[=.parse_experiment_settings]{.parse_experiment_settings}}, \code{\link[=.parse_setup_settings]{.parse_setup_settings}}, \code{\link[=.parse_preprocessing_settings]{.parse_preprocessing_settings}}, \code{\link[=.parse_feature_selection_settings]{.parse_feature_selection_settings}}, \code{\link[=.parse_model_development_settings]{.parse_model_development_settings}}, \code{\link[=.parse_hyperparameter_optimisation_settings]{.parse_hyperparameter_optimisation_settings}}
  \describe{
    \item{\code{batch_id_column}}{(\strong{recommended}) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.

In familiar any row of data is organised by four identifiers:
\itemize{
\item The batch identifier \code{batch_id_column}: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets.
\item The sample identifier \code{sample_id_column}: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level.
\item The series identifier \code{series_id_column}: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view.
\item The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
}}
    \item{\code{sample_id_column}}{(\strong{recommended}) Name of the column containing
sample or subject identifiers. See \code{batch_id_column} above for more
details.

If unset, every row will be identified as a single sample.}
    \item{\code{series_id_column}}{(\strong{optional}) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See \code{batch_id_column} above for more details.

If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers.}
    \item{\code{development_batch_id}}{(\emph{optional}) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in \code{validation_batch_id} for external validation.
Required if external validation is performed and \code{validation_batch_id} is
not provided.}
    \item{\code{validation_batch_id}}{(\emph{optional}) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in \code{development_batch_id} for external
validation, or none if not. Required if \code{development_batch_id} is not
provided.}
    \item{\code{outcome_name}}{(\emph{optional}) Name of the modelled outcome. This name will
be used in figures created by \code{familiar}.

If not set, the column name in \code{outcome_column} will be used for
\code{binomial}, \code{multinomial}, \code{count} and \code{continuous} outcomes. For other
outcomes (\code{survival} and \code{competing_risk}) no default is used.}
    \item{\code{outcome_column}}{(\strong{recommended}) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that \code{survival}
and \code{competing_risk} outcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.}
    \item{\code{outcome_type}}{(\strong{recommended}) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
\itemize{
\item \code{binomial}: categorical outcome with 2 levels.
\item \code{multinomial}: categorical outcome with 2 or more levels.
\item \code{count}: Poisson-distributed numeric outcomes.
\item \code{continuous}: general continuous numeric outcomes.
\item \code{survival}: survival outcome for time-to-event data.
}

If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.

Note that \code{competing_risk} survival analysis are not fully supported, and
is currently not a valid choice for \code{outcome_type}.}
    \item{\code{class_levels}}{(\emph{optional}) Class levels for \code{binomial} or \code{multinomial}
outcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.}
    \item{\code{event_indicator}}{(\strong{recommended}) Indicator for events in \code{survival}
and \code{competing_risk} analyses. \code{familiar} will automatically recognise \code{1},
\code{true}, \code{t}, \code{y} and \code{yes} as event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.}
    \item{\code{censoring_indicator}}{(\strong{recommended}) Indicator for right-censoring in
\code{survival} and \code{competing_risk} analyses. \code{familiar} will automatically
recognise \code{0}, \code{false}, \code{f}, \code{n}, \code{no} as censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.}
    \item{\code{competing_risk_indicator}}{(\strong{recommended}) Indicator for competing
risks in \code{competing_risk} analyses. There are no default values, and if
unset, all values other than those specified by the \code{event_indicator} and
\code{censoring_indicator} parameters are considered to indicate competing
risks.}
    \item{\code{signature}}{(\emph{optional}) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features.}
    \item{\code{novelty_features}}{(\emph{optional}) One or more names of feature columns
that should be included for the purpose of novelty detection.}
    \item{\code{exclude_features}}{(\emph{optional}) Feature columns that will be removed
from the data set. Cannot overlap with features in \code{signature},
\code{novelty_features} or \code{include_features}.}
    \item{\code{include_features}}{(\emph{optional}) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with \code{exclude_features}, but may overlap \code{signature}. Features in
\code{signature} and \code{novelty_features} are always included. If both
\code{exclude_features} and \code{include_features} are provided, \code{include_features}
takes precedence, provided that there is no overlap between the two.}
    \item{\code{reference_method}}{(\emph{optional}) Method used to set reference levels for
categorical features. There are several options:
\itemize{
\item \code{auto} (default): Categorical features that are not explicitly set by the
user, i.e. columns containing boolean values or characters, use the most
frequent level as reference. Categorical features that are explicitly set,
i.e. as factors, are used as is.
\item \code{always}: Both automatically detected and user-specified categorical
features have the reference level set to the most frequent level. Ordinal
features are not altered, but are used as is.
\item \code{never}: User-specified categorical features are used as is.
Automatically detected categorical features are simply sorted, and the
first level is then used as the reference level. This was the behaviour
prior to familiar version 1.3.0.
}}
    \item{\code{imbalance_correction_method}}{(\emph{optional}) Type of method used to
address class imbalances. Available options are:
\itemize{
\item \code{full_undersampling} (default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
\item \code{random_undersampling}: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
}

This parameter is only used in combination with imbalance partitioning in
the experimental design, and \code{ip} should therefore appear in the string
that defines the design.}
    \item{\code{imbalance_n_partitions}}{(\emph{optional}) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default.}
    \item{\code{parallel}}{(\emph{optional}) Enable parallel processing. Defaults to \code{TRUE}.
When set to \code{FALSE}, this disables all parallel processing, regardless of
specific parameters such as \code{parallel_preprocessing}. However, when
\code{parallel} is \code{TRUE}, parallel processing of different parts of the
workflow can be disabled by setting respective flags to \code{FALSE}.}
    \item{\code{parallel_nr_cores}}{(\emph{optional}) Number of cores available for
parallelisation. Defaults to 2. This setting does nothing if
parallelisation is disabled.}
    \item{\code{restart_cluster}}{(\emph{optional}) Restart nodes used for parallel computing
to free up memory prior to starting a parallel process. Note that it does
take time to set up the clusters. Therefore setting this argument to \code{TRUE}
may impact processing speed. This argument is ignored if \code{parallel} is
\code{FALSE} or the cluster was initialised outside of familiar. Default is
\code{FALSE}, which causes the clusters to be initialised only once.}
    \item{\code{cluster_type}}{(\emph{optional}) Selection of the cluster type for parallel
processing. Available types are the ones supported by the parallel package
that is part of the base R distribution: \code{psock} (default), \code{fork}, \code{mpi},
\code{nws}, \code{sock}. In addition, \code{none} is available, which also disables
parallel processing.}
    \item{\code{backend_type}}{(\emph{optional}) Selection of the backend for distributing
copies of the data. This backend ensures that only a single master copy is
kept in memory. This limits memory usage during parallel processing.

Several backend options are available, notably \code{socket_server}, and \code{none}
(default). \code{socket_server} is based on the callr package and R sockets,
comes with \code{familiar} and is available for any OS. \code{none} uses the package
environment of familiar to store data, and is available for any OS.
However, \code{none} requires copying of data to any parallel process, and has a
larger memory footprint.}
    \item{\code{server_port}}{(\emph{optional}) Integer indicating the port on which the
socket server or RServe process should communicate. Defaults to port 6311.
Note that ports 0 to 1024 and 49152 to 65535 cannot be used.}
    \item{\code{feature_max_fraction_missing}}{(\emph{optional}) Numeric value between \code{0.0}
and \code{0.95} that determines the meximum fraction of missing values that
still allows a feature to be included in the data set. All features with a
missing value fraction over this threshold are not processed further. The
default value is \code{0.30}.}
    \item{\code{sample_max_fraction_missing}}{(\emph{optional}) Numeric value between \code{0.0}
and \code{0.95} that determines the maximum fraction of missing values that
still allows a sample to be included in the data set. All samples with a
missing value fraction over this threshold are excluded and not processed
further. The default value is \code{0.30}.}
    \item{\code{filter_method}}{(\emph{optional}) One or methods used to reduce
dimensionality of the data set by removing irrelevant or poorly
reproducible features.

Several method are available:
\itemize{
\item \code{none} (default): None of the features will be filtered.
\item \code{low_variance}: Features with a variance below the
\code{low_var_minimum_variance_threshold} are filtered. This can be useful to
filter, for example, genes that are not differentially expressed.
\item \code{univariate_test}: Features undergo a univariate regression using an
outcome-appropriate regression model. The p-value of the model coefficient
is collected. Features with coefficient p or q-value above the
\code{univariate_test_threshold} are subsequently filtered.
\item \code{robustness}: Features that are not sufficiently robust according to the
intraclass correlation coefficient are filtered. Use of this method
requires that repeated measurements are present in the data set, i.e. there
should be entries for which the sample and cohort identifiers are the same.
}

More than one method can be used simultaneously. Features with singular
values are always filtered, as these do not contain information.}
    \item{\code{univariate_test_threshold}}{(\emph{optional}) Numeric value between \code{1.0} and
\code{0.0} that determines which features are irrelevant and will be filtered by
the \code{univariate_test}. The p or q-values are compared to this threshold.
All features with values above the threshold are filtered. The default
value is \code{0.20}.}
    \item{\code{univariate_test_threshold_metric}}{(\emph{optional}) Metric used with the to
compare the \code{univariate_test_threshold} against. The following metrics can
be chosen:
\itemize{
\item \code{p_value} (default): The unadjusted p-value of each feature is used for
to filter features.
\item \code{q_value}: The q-value (Story, 2002), is used to filter features. Some
data sets may have insufficient samples to compute the q-value. The
\code{qvalue} package must be installed from Bioconductor to use this method.
}}
    \item{\code{univariate_test_max_feature_set_size}}{(\emph{optional}) Maximum size of the
feature set after the univariate test. P or q values of features are
compared against the threshold, but if the resulting data set would be
larger than this setting, only the most relevant features up to the desired
feature set size are selected.

The default value is \code{NULL}, which causes features to be filtered based on
their relevance only.}
    \item{\code{low_var_minimum_variance_threshold}}{(required, if used) Numeric value
that determines which features will be filtered by the \code{low_variance}
method. The variance of each feature is computed and compared to the
threshold. If it is below the threshold, the feature is removed.

This parameter has no default value and should be set if \code{low_variance} is
used.}
    \item{\code{low_var_max_feature_set_size}}{(\emph{optional}) Maximum size of the feature
set after filtering features with a low variance. All features are first
compared against \code{low_var_minimum_variance_threshold}. If the resulting
feature set would be larger than specified, only the most strongly varying
features will be selected, up to the desired size of the feature set.

The default value is \code{NULL}, which causes features to be filtered based on
their variance only.}
    \item{\code{robustness_icc_type}}{(\emph{optional}) String indicating the type of
intraclass correlation coefficient (\code{1}, \code{2} or \code{3}) that should be used to
compute robustness for features in repeated measurements. These types
correspond to the types in Shrout and Fleiss (1979). The default value is
\code{1}.}
    \item{\code{robustness_threshold_metric}}{(\emph{optional}) String indicating which
specific intraclass correlation coefficient (ICC) metric should be used to
filter features. This should be one of:
\itemize{
\item \code{icc}: The estimated ICC value itself.
\item \code{icc_low} (default): The estimated lower limit of the 95\% confidence
interval of the ICC, as suggested by Koo and Li (2016).
\item \code{icc_panel}: The estimated ICC value over the panel average, i.e. the ICC
that would be obtained if all repeated measurements were averaged.
\item \code{icc_panel_low}: The estimated lower limit of the 95\% confidence interval
of the panel ICC.
}}
    \item{\code{robustness_threshold_value}}{(\emph{optional}) The intraclass correlation
coefficient value that is as threshold. The default value is \code{0.70}.}
    \item{\code{transformation_method}}{(\emph{optional}) The transformation method used to
change the distribution of the data to be more normal-like. The following
methods are available:
\itemize{
\item \code{none}: This disables transformation of features.
\item \code{yeo_johnson} (default): Transformation using the Yeo-Johnson
transformation (Yeo and Johnson, 2000). The algorithm tests various lambda
values (-2.0, -1.0, -0.5, 0.0, 0.33333, 0.5, 1.0, 1.5, 2.0) and selects the
lambda that maximises the log-likelihood.
\item \code{yeo_johnson_trim}: As \code{yeo_johnson}, but based on the set of feature
values where the 5\% lowest and 5\% highest values are discarded. This
reduces the effect of outliers.
\item \code{yeo_johnson_winsor}: As \code{yeo_johnson}, but based on the set of feature
values where the 5\% lowest and 5\% highest values are winsorised. This
reduces the effect of outliers.
\item \code{box_cox}: Transformation using the Box-Cox transformation (Box and Cox,
1964). Unlike the Yeo-Johnson transformation, the Box-Cox transformation
requires that all data are positive. Features that contain zero or negative
values cannot be transformed using this transformation. The algorithm tests
various lambda values (-2.0, -1.0, -0.5, 0.0, 0.3333, 0.5, 1.0, 1.5, 2.0)
and selects the lambda that maximises the log-likelihood.
\item \code{box_cox_trim}: As \code{box_cox}, but based on the set of feature values
where the 5\% lowest and 5\% highest values are discarded. This reduces the
effect of outliers.
\item \code{box_cox_winsor}: As \code{box_cox}, but based on the set of feature values
where the 5\% lowest and 5\% highest values are winsorised. This reduces the
effect of outliers.
}

Only features that contain numerical data are transformed. Transformation
parameters obtained in development data are stored within \code{featureInfo}
objects for later use with validation data sets.}
    \item{\code{normalisation_method}}{(\emph{optional}) The normalisation method used to
improve the comparability between numerical features that may have very
different scales. The following normalisation methods can be chosen:
\itemize{
\item \code{none}: This disables feature normalisation.
\item \code{standardisation} (default): Features are normalised by subtraction of
their mean values and division by their standard deviations. This causes
every feature to be have a center value of 0.0 and standard deviation of
1.0.
\item \code{standardisation_trim}: As \code{standardisation}, but based on the set of
feature values where the 5\% lowest and 5\% highest values are discarded.
This reduces the effect of outliers.
\item \code{standardisation_winsor}: As \code{standardisation}, but based on the set of
feature values where the 5\% lowest and 5\% highest values are winsorised.
This reduces the effect of outliers.
\item \code{normalisation}: Features are normalised by subtraction of their minimum
values and division by their ranges. This maps all feature values to a
\eqn{[0, 1]} interval.
\item \code{normalisation_trim}: As \code{normalisation}, but based on the set of feature
values where the 5\% lowest and 5\% highest values are discarded. This
reduces the effect of outliers.
\item \code{normalisation_winsor}: As \code{normalisation}, but based on the set of
feature values where the 5\% lowest and 5\% highest values are winsorised.
This reduces the effect of outliers.
\item \code{quantile}: Features are normalised by subtraction of their median values
and division by their interquartile range.
\item \code{mean_centering}: Features are centered by substracting the mean, but do
not undergo rescaling.
}

Only features that contain numerical data are normalised. Normalisation
parameters obtained in development data are stored within \code{featureInfo}
objects for later use with validation data sets.}
    \item{\code{batch_normalisation_method}}{(\emph{optional}) The method used for batch
normalisation. Available methods are:
\itemize{
\item \code{none} (default): This disables batch normalisation of features.
\item \code{standardisation}: Features within each batch are normalised by
subtraction of the mean value and division by the standard deviation in
each batch.
\item \code{standardisation_trim}: As \code{standardisation}, but based on the set of
feature values where the 5\% lowest and 5\% highest values are discarded.
This reduces the effect of outliers.
\item \code{standardisation_winsor}: As \code{standardisation}, but based on the set of
feature values where the 5\% lowest and 5\% highest values are winsorised.
This reduces the effect of outliers.
\item \code{normalisation}: Features within each batch are normalised by subtraction
of their minimum values and division by their range in each batch. This
maps all feature values in each batch to a \eqn{[0, 1]} interval.
\item \code{normalisation_trim}: As \code{normalisation}, but based on the set of feature
values where the 5\% lowest and 5\% highest values are discarded. This
reduces the effect of outliers.
\item \code{normalisation_winsor}: As \code{normalisation}, but based on the set of
feature values where the 5\% lowest and 5\% highest values are winsorised.
This reduces the effect of outliers.
\item \code{quantile}: Features in each batch are normalised by subtraction of the
median value and division by the interquartile range of each batch.
\item \code{mean_centering}: Features in each batch are centered on 0.0 by
substracting the mean value in each batch, but are not rescaled.
\item \code{combat_parametric}: Batch adjustments using parametric empirical Bayes
(Johnson et al, 2007). \code{combat_p} leads to the same method.
\item \code{combat_non_parametric}: Batch adjustments using non-parametric empirical
Bayes (Johnson et al, 2007). \code{combat_np} and \code{combat} lead to the same
method. Note that we reduced complexity from O(\eqn{n^2}) to O(\eqn{n}) by
only computing batch adjustment parameters for each feature on a subset of
50 randomly selected features, instead of all features.
}

Only features that contain numerical data are normalised using batch
normalisation. Batch normalisation parameters obtained in development data
are stored within \code{featureInfo} objects for later use with validation data
sets, in case the validation data is from the same batch.

If validation data contains data from unknown batches, normalisation
parameters are separately determined for these batches.

Note that for both empirical Bayes methods, the batch effect is assumed to
produce results across the features. This is often true for things such as
gene expressions, but the assumption may not hold generally.

When performing batch normalisation, it is moreover important to check that
differences between batches or cohorts are not related to the studied
endpoint.}
    \item{\code{imputation_method}}{(\emph{optional}) Method used for imputing missing
feature values. Two methods are implemented:
\itemize{
\item \code{simple}: Simple replacement of a missing value by the median value (for
numeric features) or the modal value (for categorical features).
\item \code{lasso}: Imputation of missing value by lasso regression (using \code{glmnet})
based on information contained in other features.
}

\code{simple} imputation precedes \code{lasso} imputation to ensure that any missing
values in predictors required for \code{lasso} regression are resolved. The
\code{lasso} estimate is then used to replace the missing value.

The default value depends on the number of features in the dataset. If the
number is lower than 100, \code{lasso} is used by default, and \code{simple}
otherwise.

Only single imputation is performed. Imputation models and parameters are
stored within \code{featureInfo} objects for later use with validation data
sets.}
    \item{\code{cluster_method}}{(\emph{optional}) Clustering is performed to identify and
replace redundant features, for example those that are highly correlated.
Such features do not carry much additional information and may be removed
or replaced instead (Park et al., 2007; Tolosi and Lengauer, 2011).

The cluster method determines the algorithm used to form the clusters. The
following cluster methods are implemented:
\itemize{
\item \code{none}: No clustering is performed.
\item \code{hclust} (default): Hierarchical agglomerative clustering. If the
\code{fastcluster} package is installed, \code{fastcluster::hclust} is used (Muellner
2013), otherwise \code{stats::hclust} is used.
\item \code{agnes}: Hierarchical clustering using agglomerative nesting (Kaufman and
Rousseeuw, 1990). This algorithm is similar to \code{hclust}, but uses the
\code{cluster::agnes} implementation.
\item \code{diana}: Divisive analysis hierarchical clustering. This method uses
divisive instead of agglomerative clustering (Kaufman and Rousseeuw, 1990).
\code{cluster::diana} is used.
\item \code{pam}: Partioning around medioids. This partitions the data into $k$
clusters around medioids (Kaufman and Rousseeuw, 1990). $k$ is selected
using the \code{silhouette} metric. \code{pam} is implemented using the
\code{cluster::pam} function.
}

Clusters and cluster information is stored within \code{featureInfo} objects for
later use with validation data sets. This enables reproduction of the same
clusters as formed in the development data set.}
    \item{\code{cluster_linkage_method}}{(\emph{optional}) Linkage method used for
agglomerative clustering in \code{hclust} and \code{agnes}. The following linkage
methods can be used:
\itemize{
\item \code{average} (default): Average linkage.
\item \code{single}: Single linkage.
\item \code{complete}: Complete linkage.
\item \code{weighted}: Weighted linkage, also known as McQuitty linkage.
\item \code{ward}: Linkage using Ward's minimum variance method.
}

\code{diana} and \code{pam} do not require a linkage method.}
    \item{\code{cluster_cut_method}}{(\emph{optional}) The method used to define the actual
clusters. The following methods can be used:
\itemize{
\item \code{silhouette}: Clusters are formed based on the silhouette score
(Rousseeuw, 1987). The average silhouette score is computed from 2 to
\eqn{n} clusters, with \eqn{n} the number of features. Clusters are only
formed if the average silhouette exceeds 0.50, which indicates reasonable
evidence for structure. This procedure may be slow if the number of
features is large (>100s).
\item \code{fixed_cut}: Clusters are formed by cutting the hierarchical tree at the
point indicated by the \code{cluster_similarity_threshold}, e.g. where features
in a cluster have an average Spearman correlation of 0.90. \code{fixed_cut} is
only available for \code{agnes}, \code{diana} and \code{hclust}.
\item \code{dynamic_cut}: Dynamic cluster formation using the cutting algorithm in
the \code{dynamicTreeCut} package. This package should be installed to select
this option. \code{dynamic_cut} can only be used with \code{agnes} and \code{hclust}.
}

The default options are \code{silhouette} for partioning around medioids (\code{pam})
and \code{fixed_cut} otherwise.}
    \item{\code{cluster_similarity_metric}}{(\emph{optional}) Clusters are formed based on
feature similarity. All features are compared in a pair-wise fashion to
compute similarity, for example correlation. The resulting similarity grid
is converted into a distance matrix that is subsequently used for
clustering. The following metrics are supported to compute pairwise
similarities:
\itemize{
\item \code{mutual_information} (default): normalised mutual information.
\item \code{mcfadden_r2}: McFadden's pseudo R-squared (McFadden, 1974).
\item \code{cox_snell_r2}: Cox and Snell's pseudo R-squared (Cox and Snell, 1989).
\item \code{nagelkerke_r2}: Nagelkerke's pseudo R-squared (Nagelkerke, 1991).
\item \code{spearman}: Spearman's rank order correlation.
\item \code{kendall}: Kendall rank correlation.
\item \code{pearson}: Pearson product-moment correlation.
}

The pseudo R-squared metrics can be used to assess similarity between mixed
pairs of numeric and categorical features, as these are based on the
log-likelihood of regression models. In \code{familiar}, the more informative
feature is used as the predictor and the other feature as the reponse
variable. In numeric-categorical pairs, the numeric feature is considered
to be more informative and is thus used as the predictor. In
categorical-categorical pairs, the feature with most levels is used as the
predictor.

In case any of the classical correlation coefficients (\code{pearson},
\code{spearman} and \code{kendall}) are used with (mixed) categorical features, the
categorical features are one-hot encoded and the mean correlation over all
resulting pairs is used as similarity.}
    \item{\code{cluster_similarity_threshold}}{(\emph{optional}) The threshold level for
pair-wise similarity that is required to form clusters using \code{fixed_cut}.
This should be a numerical value between 0.0 and 1.0. Note however, that a
reasonable threshold value depends strongly on the similarity metric. The
following are the default values used:
\itemize{
\item \code{mcfadden_r2} and \code{mutual_information}: \code{0.30}
\item \code{cox_snell_r2} and \code{nagelkerke_r2}: \code{0.75}
\item \code{spearman}, \code{kendall} and \code{pearson}: \code{0.90}
}

Alternatively, if the \verb{fixed cut} method is not used, this value determines
whether any clustering should be performed, because the data may not
contain highly similar features. The default values in this situation are:
\itemize{
\item \code{mcfadden_r2}  and \code{mutual_information}: \code{0.25}
\item \code{cox_snell_r2} and \code{nagelkerke_r2}: \code{0.40}
\item \code{spearman}, \code{kendall} and \code{pearson}: \code{0.70}
}

The threshold value is converted to a distance (1-similarity) prior to
cutting hierarchical trees.}
    \item{\code{cluster_representation_method}}{(\emph{optional}) Method used to determine
how the information of co-clustered features is summarised and used to
represent the cluster. The following methods can be selected:
\itemize{
\item \code{best_predictor} (default): The feature with the highest importance
according to univariate regression with the outcome is used to represent
the cluster.
\item \code{medioid}: The feature closest to the cluster center, i.e. the feature
that is most similar to the remaining features in the cluster, is used to
represent the feature.
\item \code{mean}: A meta-feature is generated by averaging the feature values for
all features in a cluster. This method aligns all features so that all
features will be positively correlated prior to averaging. Should a cluster
contain one or more categorical features, the \code{medioid} method will be used
instead, as averaging is not possible. Note that if this method is chosen,
the \code{normalisation_method} parameter should be one of \code{standardisation},
\code{standardisation_trim}, \code{standardisation_winsor} or \code{quantile}.`
}

If the \code{pam} cluster method is selected, only the \code{medioid} method can be
used. In that case 1 medioid is used by default.}
    \item{\code{parallel_preprocessing}}{(\emph{optional}) Enable parallel processing for the
preprocessing workflow. Defaults to \code{TRUE}. When set to \code{FALSE}, this will
disable the use of parallel processing while preprocessing, regardless of
the settings of the \code{parallel} parameter. \code{parallel_preprocessing} is
ignored if \code{parallel=FALSE}.}
    \item{\code{fs_method}}{(\strong{required}) Feature selection method to be used for
determining variable importance. \code{familiar} implements various feature
selection methods. Please refer to the vignette on feature selection
methods for more details.

More than one feature selection method can be chosen. The experiment will
then repeated for each feature selection method.

Feature selection methods determines the ranking of features. Actual
selection of features is done by optimising the signature size model
hyperparameter during the hyperparameter optimisation step.}
    \item{\code{fs_method_parameter}}{(\emph{optional}) List of lists containing parameters
for feature selection methods. Each sublist should have the name of the
feature selection method it corresponds to.

Most feature selection methods do not have parameters that can be set.
Please refer to the vignette on feature selection methods for more details.
Note that if the feature selection method is based on a learner (e.g. lasso
regression), hyperparameter optimisation may be performed prior to
assessing variable importance.}
    \item{\code{vimp_aggregation_method}}{(\emph{optional}) The method used to aggregate
variable importances over different data subsets, e.g. bootstraps. The
following methods can be selected:
\itemize{
\item \code{none}: Don't aggregate ranks, but rather aggregate the variable
importance scores themselves.
\item \code{mean}: Use the mean rank of a feature over the subsets to
determine the aggregated feature rank.
\item \code{median}: Use the median rank of a feature over the subsets to determine
the aggregated feature rank.
\item \code{best}: Use the best rank the feature obtained in any subset to determine
the aggregated feature rank.
\item \code{worst}: Use the worst rank the feature obtained in any subset to
determine the aggregated feature rank.
\item \code{stability}: Use the frequency of the feature being in the subset of
highly ranked features as measure for the aggregated feature rank
(Meinshausen and Buehlmann, 2010).
\item \code{exponential}: Use a rank-weighted frequence of occurrence in the subset
of highly ranked features as measure for the aggregated feature rank (Haury
et al., 2011).
\item \code{borda} (default): Use the borda count as measure for the aggregated
feature rank (Wald et al., 2012).
\item \code{enhanced_borda}: Use an occurrence frequency-weighted borda count as
measure for the aggregated feature rank (Wald et al., 2012).
\item \code{truncated_borda}: Use borda count computed only on features within the
subset of highly ranked features.
\item \code{enhanced_truncated_borda}: Apply both the enhanced borda method and the
truncated borda method and use the resulting borda count as the aggregated
feature rank.
}

The \emph{feature selection methods} vignette provides additional information.}
    \item{\code{vimp_aggregation_rank_threshold}}{(\emph{optional}) The threshold used to
define the subset of highly important features. If not set, this threshold
is determined by maximising the variance in the occurrence value over all
features over the subset size.

This parameter is only relevant for \code{stability}, \code{exponential},
\code{enhanced_borda}, \code{truncated_borda} and \code{enhanced_truncated_borda} methods.}
    \item{\code{parallel_feature_selection}}{(\emph{optional}) Enable parallel processing for
the feature selection workflow. Defaults to \code{TRUE}. When set to \code{FALSE},
this will disable the use of parallel processing while performing feature
selection, regardless of the settings of the \code{parallel} parameter.
\code{parallel_feature_selection} is ignored if \code{parallel=FALSE}.}
    \item{\code{novelty_detector}}{(\emph{optional}) Specify the algorithm used for training
a novelty detector. This detector can be used to identify
out-of-distribution data prospectively.}
    \item{\code{detector_parameters}}{(\emph{optional}) List lists containing hyperparameters
for novelty detectors. Currently not used.}
    \item{\code{parallel_model_development}}{(\emph{optional}) Enable parallel processing for
the model development workflow. Defaults to \code{TRUE}. When set to \code{FALSE},
this will disable the use of parallel processing while developing models,
regardless of the settings of the \code{parallel} parameter.
\code{parallel_model_development} is ignored if \code{parallel=FALSE}.}
    \item{\code{optimisation_bootstraps}}{(\emph{optional}) Number of bootstraps that should
be generated from the development data set. During the optimisation
procedure one or more of these bootstraps (indicated by
\code{smbo_step_bootstraps}) are used for model development using different
combinations of hyperparameters. The effect of the hyperparameters is then
assessed by comparing in-bag and out-of-bag model performance.

The default number of bootstraps is \code{50}. Hyperparameter optimisation may
finish before exhausting the set of bootstraps.}
    \item{\code{optimisation_determine_vimp}}{(\emph{optional}) Logical value that indicates
whether variable importance is determined separately for each of the
bootstraps created during the optimisation process (\code{TRUE}) or the
applicable results from the feature selection step are used (\code{FALSE}).

Determining variable importance increases the initial computational
overhead. However, it prevents positive biases for the out-of-bag data due
to overlap of these data with the development data set used for the feature
selection step. In this case, any hyperparameters of the variable
importance method are not determined separately for each bootstrap, but
those obtained during the feature selection step are used instead. In case
multiple of such hyperparameter sets could be applicable, the set that will
be used is randomly selected for each bootstrap.

This parameter only affects hyperparameter optimisation of learners. The
default is \code{TRUE}.}
    \item{\code{smbo_random_initialisation}}{(\emph{optional}) String indicating the
initialisation method for the hyperparameter space. Can be one of
\code{fixed_subsample} (default), \code{fixed}, or \code{random}. \code{fixed} and
\code{fixed_subsample} first create hyperparameter sets from a range of default
values set by familiar. \code{fixed_subsample} then randomly draws up to
\code{smbo_n_random_sets} from the grid. \code{random} does not rely upon a fixed
grid, and randomly draws up to \code{smbo_n_random_sets} hyperparameter sets
from the hyperparameter space.}
    \item{\code{smbo_n_random_sets}}{(\emph{optional}) Number of random or subsampled
hyperparameters drawn during the initialisation process. Default: \code{100}.
Cannot be smaller than \code{10}. The parameter is not used when
\code{smbo_random_initialisation} is \code{fixed}, as the entire pre-defined grid
will be explored.}
    \item{\code{max_smbo_iterations}}{(\emph{optional}) Maximum number of intensify
iterations of the SMBO algorithm. During an intensify iteration a run-off
occurs between the current \emph{best} hyperparameter combination and either 10
challenger combination with the highest expected improvement or a set of 20
random combinations.

Run-off with random combinations is used to force exploration of the
hyperparameter space, and is performed every second intensify iteration, or
if there is no expected improvement for any challenger combination.

If a combination of hyperparameters leads to better performance on the same
data than the incumbent \emph{best} set of hyperparameters, it replaces the
incumbent set at the end of the intensify iteration.

The default number of intensify iteration is \code{20}. Iterations may be
stopped early if the incumbent set of hyperparameters remains the same for
\code{smbo_stop_convergent_iterations} iterations, or performance improvement is
minimal. This behaviour is suppressed during the first 4 iterations to
enable the algorithm to explore the hyperparameter space.}
    \item{\code{smbo_stop_convergent_iterations}}{(\emph{optional}) The number of subsequent
convergent SMBO iterations required to stop hyperparameter optimisation
early. An iteration is convergent if the \emph{best} parameter set has not
changed or the optimisation score over the 4 most recent iterations has not
changed beyond the tolerance level in \code{smbo_stop_tolerance}.

The default value is \code{3}.}
    \item{\code{smbo_stop_tolerance}}{(\emph{optional}) Tolerance for early stopping due to
convergent optimisation score.

The default value depends on the square root of the number of samples (at
the series level), and is \code{0.01} for 100 samples. This value is computed as
\code{0.1 * 1 / sqrt(n_samples)}. The upper limit is \code{0.0001} for 1M or more
samples.}
    \item{\code{smbo_time_limit}}{(\emph{optional}) Time limit (in minutes) for the
optimisation process. Optimisation is stopped after this limit is exceeded.
Time taken to determine variable importance for the optimisation process
(see the \code{optimisation_determine_vimp} parameter) does not count.

The default is \code{NULL}, indicating that there is no time limit for the
optimisation process. The time limit cannot be less than 1 minute.}
    \item{\code{smbo_initial_bootstraps}}{(\emph{optional}) The number of bootstraps taken
from the set of \code{optimisation_bootstraps} as the bootstraps assessed
initially.

The default value is \code{1}. The value cannot be larger than
\code{optimisation_bootstraps}.}
    \item{\code{smbo_step_bootstraps}}{(\emph{optional}) The number of bootstraps taken from
the set of \code{optimisation_bootstraps} bootstraps as the bootstraps assessed
during the steps of each intensify iteration.

The default value is \code{3}. The value cannot be larger than
\code{optimisation_bootstraps}.}
    \item{\code{smbo_intensify_steps}}{(\emph{optional}) The number of steps in each SMBO
intensify iteration. Each step a new set of \code{smbo_step_bootstraps}
bootstraps is drawn and used in the run-off between the incumbent \emph{best}
hyperparameter combination and its challengers.

The default value is \code{5}. Higher numbers allow for a more detailed
comparison, but this comes with added computational cost.}
    \item{\code{optimisation_metric}}{(\emph{optional}) One or more metrics used to compute
performance scores. See the vignette on performance metrics for the
available metrics.

If unset, the following metrics are used by default:
\itemize{
\item \code{auc_roc}: For \code{binomial} and \code{multinomial} models.
\item \code{mse}: Mean squared error for \code{continuous} models.
\item \code{msle}: Mean squared logarithmic error for \code{count} models.
\item \code{concordance_index}: For \code{survival} models.
}

Multiple optimisation metrics can be specified. Actual metric values are
converted to an objective value by comparison with a baseline metric value
that derives from a trivial model, i.e. majority class for binomial and
multinomial outcomes, the median outcome for count and continuous outcomes
and a fixed risk or time for survival outcomes.}
    \item{\code{optimisation_function}}{(\emph{optional}) Type of optimisation function used
to quantify the performance of a hyperparameter set. Model performance is
assessed using the metric(s) specified by \code{optimisation_metric} on the
in-bag (IB) and out-of-bag (OOB) samples of a bootstrap. These values are
converted to objective scores with a standardised interval of \eqn{[-1.0,
  1.0]}. Each pair of objective is subsequently used to compute an
optimisation score. The optimisation score across different bootstraps is
than aggregated to a summary score. This summary score is used to rank
hyperparameter sets, and select the optimal set.

The combination of optimisation score and summary score is determined by
the optimisation function indicated by this parameter:
\itemize{
\item \code{validation} or \code{max_validation} (default): seeks to maximise OOB score.
\item \code{balanced}: seeks to balance IB and OOB score.
\item \code{stronger_balance}: similar to \code{balanced}, but with stronger penalty for
differences between IB and OOB scores.
\item \code{validation_minus_sd}: seeks to optimise the average OOB score minus its
standard deviation.
\item \code{validation_25th_percentile}: seeks to optimise the 25th percentile of
OOB scores, and is conceptually similar to \code{validation_minus_sd}.
\item \code{model_estimate}: seeks to maximise the OOB score estimate predicted by
the hyperparameter learner (not available for random search).
\item \code{model_estimate_minus_sd}: seeks to maximise the OOB score estimate minus
its estimated standard deviation, as predicted by the hyperparameter
learner (not available for random search).
\item \code{model_balanced_estimate}: seeks to maximise the estimate of the balanced
IB and OOB score. This is similar to the \code{balanced} score, and in fact uses
a hyperparameter learner to predict said score (not available for random
search).
\item \code{model_balanced_estimate_minus_sd}: seeks to maximise the estimate of the
balanced IB and OOB score, minus its estimated standard deviation. This is
similar to the \code{balanced} score, but takes into account its estimated
spread.
}

Additional detail are provided in the \emph{Learning algorithms and
hyperparameter optimisation} vignette.}
    \item{\code{hyperparameter_learner}}{(\emph{optional}) Any point in the hyperparameter
space has a single, scalar, optimisation score value that is \emph{a priori}
unknown. During the optimisation process, the algorithm samples from the
hyperparameter space by selecting hyperparameter sets and computing the
optimisation score value for one or more bootstraps. For each
hyperparameter set the resulting values are distributed around the actual
value. The learner indicated by \code{hyperparameter_learner} is then used to
infer optimisation score estimates for unsampled parts of the
hyperparameter space.

The following models are available:
\itemize{
\item \code{bayesian_additive_regression_trees} or \code{bart}: Uses Bayesian Additive
Regression Trees (Sparapani et al., 2021) for inference. Unlike standard
random forests, BART allows for estimating posterior distributions directly
and can extrapolate.
\item \code{gaussian_process} (default): Creates a localised approximate Gaussian
process for inference (Gramacy, 2016). This allows for better scaling than
deterministic Gaussian Processes.
\item \code{random_forest}: Creates a random forest for inference. Originally
suggested by Hutter et al. (2011). A weakness of random forests is their
lack of extrapolation beyond observed values, which limits their usefulness
in exploiting promising areas of hyperparameter space.
\item \code{random} or \code{random_search}: Forgoes the use of models to steer
optimisation. Instead, a random search is performed.
}}
    \item{\code{acquisition_function}}{(\emph{optional}) The acquisition function influences
how new hyperparameter sets are selected. The algorithm uses the model
learned by the learner indicated by \code{hyperparameter_learner} to search the
hyperparameter space for hyperparameter sets that are either likely better
than the best known set (\emph{exploitation}) or where there is considerable
uncertainty (\emph{exploration}). The acquisition function quantifies this
(Shahriari et al., 2016).

The following acquisition functions are available, and are described in
more detail in the \emph{learner algorithms} vignette:
\itemize{
\item \code{improvement_probability}: The probability of improvement quantifies the
probability that the expected optimisation score for a set is better than
the best observed optimisation score
\item \code{improvement_empirical_probability}: Similar to
\code{improvement_probability}, but based directly on optimisation scores
predicted by the individual decision trees.
\item \code{expected_improvement} (default): Computes expected improvement.
\item \code{upper_confidence_bound}: This acquisition function is based on the upper
confidence bound of the distribution (Srinivas et al., 2012).
\item \code{bayes_upper_confidence_bound}: This acquisition function is based on the
upper confidence bound of the distribution (Kaufmann et al., 2012).
}}
    \item{\code{exploration_method}}{(\emph{optional}) Method used to steer exploration in
post-initialisation intensive searching steps. As stated earlier, each SMBO
iteration step compares suggested alternative parameter sets with an
incumbent \strong{best} set in a series of steps. The exploration method
controls how the set of alternative parameter sets is pruned after each
step in an iteration. Can be one of the following:
\itemize{
\item \code{single_shot} (default): The set of alternative parameter sets is not
pruned, and each intensification iteration contains only a single
intensification step that only uses a single bootstrap. This is the fastest
exploration method, but only superficially tests each parameter set.
\item \code{successive_halving}: The set of alternative parameter sets is
pruned by removing the worst performing half of the sets after each step
(Jamieson and Talwalkar, 2016).
\item \code{stochastic_reject}: The set of alternative parameter sets is pruned by
comparing the performance of each parameter set with that of the incumbent
\strong{best} parameter set using a paired Wilcoxon test based on shared
bootstraps. Parameter sets that perform significantly worse, at an alpha
level indicated by \code{smbo_stochastic_reject_p_value}, are pruned.
\item \code{none}: The set of alternative parameter sets is not pruned.
}}
    \item{\code{smbo_stochastic_reject_p_value}}{(\emph{optional}) The p-value threshold used
for the \code{stochastic_reject} exploration method.

The default value is \code{0.05}.}
    \item{\code{parallel_hyperparameter_optimisation}}{(\emph{optional}) Enable parallel
processing for hyperparameter optimisation. Defaults to \code{TRUE}. When set to
\code{FALSE}, this will disable the use of parallel processing while performing
optimisation, regardless of the settings of the \code{parallel} parameter. The
parameter moreover specifies whether parallelisation takes place within the
optimisation algorithm (\code{inner}, default), or in an outer loop ( \code{outer})
over learners, data subsamples, etc.

\code{parallel_hyperparameter_optimisation} is ignored if \code{parallel=FALSE}.}
  }}
}
\value{
One or more familiarModel objects.
}
\description{
Train models using familiar. Evaluation is not performed.
}
\details{
This is a thin wrapper around \code{summon_familiar}, and functions like
it, but automatically skips all evaluation steps. Only a single learner is
allowed.
}
