% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/ParseSettings.R
\name{.parse_initial_settings}
\alias{.parse_initial_settings}
\title{Internal function for parsing settings required to parse the input data
and define the experiment}
\usage{
.parse_initial_settings(config = NULL, ...)
}
\arguments{
\item{config}{A list of settings, e.g. from an xml file.}

\item{...}{
  Arguments passed on to \code{\link[=.parse_experiment_settings]{.parse_experiment_settings}}
  \describe{
    \item{\code{batch_id_column}}{(\strong{recommended}) Name of the column containing batch
or cohort identifiers. This parameter is required if more than one dataset
is provided, or if external validation is performed.

In familiar any row of data is organised by four identifiers:
\itemize{
\item The batch identifier \code{batch_id_column}: This denotes the group to which a
set of samples belongs, e.g. patients from a single study, samples measured
in a batch, etc. The batch identifier is used for batch normalisation, as
well as selection of development and validation datasets.
\item The sample identifier \code{sample_id_column}: This denotes the sample level,
e.g. data from a single individual. Subsets of data, e.g. bootstraps or
cross-validation folds, are created at this level.
\item The series identifier \code{series_id_column}: Indicates measurements on a
single sample that may not share the same outcome value, e.g. a time
series, or the number of cells in a view.
\item The repetition identifier: Indicates repeated measurements in a single
series where any feature values may differ, but the outcome does not.
Repetition identifiers are always implicitly set when multiple entries for
the same series of the same sample in the same batch that share the same
outcome are encountered.
}}
    \item{\code{sample_id_column}}{(\strong{recommended}) Name of the column containing
sample or subject identifiers. See \code{batch_id_column} above for more
details.

If unset, every row will be identified as a single sample.}
    \item{\code{series_id_column}}{(\strong{optional}) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See \code{batch_id_column} above for more details.

If unset, rows which share the same batch and sample identifiers but have a
different outcome are assigned unique series identifiers.}
    \item{\code{development_batch_id}}{(\emph{optional}) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in \code{validation_batch_id} for external validation.
Required if external validation is performed and \code{validation_batch_id} is
not provided.}
    \item{\code{validation_batch_id}}{(\emph{optional}) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in \code{development_batch_id} for external
validation, or none if not. Required if \code{development_batch_id} is not
provided.}
    \item{\code{outcome_name}}{(\emph{optional}) Name of the modelled outcome. This name will
be used in figures created by \code{familiar}.

If not set, the column name in \code{outcome_column} will be used for
\code{binomial}, \code{multinomial}, \code{count} and \code{continuous} outcomes. For other
outcomes (\code{survival} and \code{competing_risk}) no default is used.}
    \item{\code{outcome_column}}{(\strong{recommended}) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that \code{survival}
and \code{competing_risk} outcome type outcomes require two columns that
indicate the time-to-event or the time of last follow-up and the event
status.}
    \item{\code{outcome_type}}{(\strong{recommended}) Type of outcome found in the outcome
column. The outcome type determines many aspects of the overall process,
e.g. the available feature selection methods and learners, but also the
type of assessments that can be conducted to evaluate the resulting models.
Implemented outcome types are:
\itemize{
\item \code{binomial}: categorical outcome with 2 levels.
\item \code{multinomial}: categorical outcome with 2 or more levels.
\item \code{count}: Poisson-distributed numeric outcomes.
\item \code{continuous}: general continuous numeric outcomes.
\item \code{survival}: survival outcome for time-to-event data.
}

If not provided, the algorithm will attempt to obtain outcome_type from
contents of the outcome column. This may lead to unexpected results, and we
therefore advise to provide this information manually.

Note that \code{competing_risk} survival analysis are not fully supported, and
is currently not a valid choice for \code{outcome_type}.}
    \item{\code{class_levels}}{(\emph{optional}) Class levels for \code{binomial} or \code{multinomial}
outcomes. This argument can be used to specify the ordering of levels for
categorical outcomes. These class levels must exactly match the levels
present in the outcome column.}
    \item{\code{event_indicator}}{(\strong{recommended}) Indicator for events in \code{survival}
and \code{competing_risk} analyses. \code{familiar} will automatically recognise \code{1},
\code{true}, \code{t}, \code{y} and \code{yes} as event indicators, including different
capitalisations. If this parameter is set, it replaces the default values.}
    \item{\code{censoring_indicator}}{(\strong{recommended}) Indicator for right-censoring in
\code{survival} and \code{competing_risk} analyses. \code{familiar} will automatically
recognise \code{0}, \code{false}, \code{f}, \code{n}, \code{no} as censoring indicators, including
different capitalisations. If this parameter is set, it replaces the
default values.}
    \item{\code{competing_risk_indicator}}{(\strong{recommended}) Indicator for competing
risks in \code{competing_risk} analyses. There are no default values, and if
unset, all values other than those specified by the \code{event_indicator} and
\code{censoring_indicator} parameters are considered to indicate competing
risks.}
    \item{\code{signature}}{(\emph{optional}) One or more names of feature columns that are
considered part of a specific signature. Features specified here will
always be used for modelling. Ranking from feature selection has no effect
for these features.}
    \item{\code{novelty_features}}{(\emph{optional}) One or more names of feature columns
that should be included for the purpose of novelty detection.}
    \item{\code{exclude_features}}{(\emph{optional}) Feature columns that will be removed
from the data set. Cannot overlap with features in \code{signature},
\code{novelty_features} or \code{include_features}.}
    \item{\code{include_features}}{(\emph{optional}) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with \code{exclude_features}, but may overlap \code{signature}. Features in
\code{signature} and \code{novelty_features} are always included. If both
\code{exclude_features} and \code{include_features} are provided, \code{include_features}
takes precedence, provided that there is no overlap between the two.}
    \item{\code{experimental_design}}{(\strong{required}) Defines what the experiment looks
like, e.g. \code{cv(bt(fs,20)+mb,3,2)+ev} for 2 times repeated 3-fold
cross-validation with nested feature selection on 20 bootstraps and
model-building, and external validation. The basic workflow components are:
\itemize{
\item \code{fs}: (required) feature selection step.
\item \code{mb}: (required) model building step.
\item \code{ev}: (optional) external validation. Note that internal validation due
to subsampling will always be conducted if the subsampling methods create
any validation data sets.
}

The different components are linked using \code{+}.

Different subsampling methods can be used in conjunction with the basic
workflow components:
\itemize{
\item \code{bs(x,n)}: (stratified) .632 bootstrap, with \code{n} the number of
bootstraps. In contrast to \code{bt}, feature pre-processing parameters and
hyperparameter optimisation are conducted on individual bootstraps.
\item \code{bt(x,n)}: (stratified) .632 bootstrap, with \code{n} the number of
bootstraps. Unlike \code{bs} and other subsampling methods, no separate
pre-processing parameters or optimised hyperparameters will be determined
for each bootstrap.
\item \code{cv(x,n,p)}: (stratified) \code{n}-fold cross-validation, repeated \code{p} times.
Pre-processing parameters are determined for each iteration.
\item \code{lv(x)}: leave-one-out-cross-validation. Pre-processing parameters are
determined for each iteration.
\item \code{ip(x)}: imbalance partitioning for addressing class imbalances on the
data set. Pre-processing parameters are determined for each partition. The
number of partitions generated depends on the imbalance correction method
(see the \code{imbalance_correction_method} parameter). Imbalance partitioning
does not generate validation sets.
}

As shown in the example above, sampling algorithms can be nested.

The simplest valid experimental design is \code{fs+mb}, which corresponds to a
TRIPOD type 1a analysis. Type 1b analyses are only possible using
bootstraps, e.g. \code{bt(fs+mb,100)}. Type 2a analyses can be conducted using
cross-validation, e.g. \code{cv(bt(fs,100)+mb,10,1)}. Depending on the origin of
the external validation data, designs such as \code{fs+mb+ev} or
\code{cv(bt(fs,100)+mb,10,1)+ev} constitute type 2b or type 3 analyses. Type 4
analyses can be done by obtaining one or more \code{familiarModel} objects from
others and applying them to your own data set.

Alternatively, the \code{experiment_design} parameter may be used to provide a
path to a file containing iterations, which is named \verb{####_iterations.RDS}
by convention. This path can be relative to the directory of the current
experiment (\code{experiment_dir}), or an absolute path. The absolute path may
thus also point to a file from a different experiment.}
    \item{\code{imbalance_correction_method}}{(\emph{optional}) Type of method used to
address class imbalances. Available options are:
\itemize{
\item \code{full_undersampling} (default): All data will be used in an ensemble
fashion. The full minority class will appear in each partition, but
majority classes are undersampled until all data have been used.
\item \code{random_undersampling}: Randomly undersamples majority classes. This is
useful in cases where full undersampling would lead to the formation of
many models due major overrepresentation of the largest class.
}

This parameter is only used in combination with imbalance partitioning in
the experimental design, and \code{ip} should therefore appear in the string
that defines the design.}
    \item{\code{imbalance_n_partitions}}{(\emph{optional}) Number of times random
undersampling should be repeated. 10 undersampled subsets with balanced
classes are formed by default.}
  }}
}
\value{
A list of settings to be used for configuring the experiments.
}
\description{
This function parses settings required to parse the data set, e.g. determine
which columns are identfier columns, what column contains outcome data, which
type of outcome is it?
}
\details{
Three variants of parameters exist:
\itemize{
\item required: this parameter is required and must be set by the user.
\item recommended: not setting this parameter might cause an error to be thrown,
dependent on other input.
\item optional: these parameters have default values that may be altered if
required.
}
}
\keyword{internal}
