% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/mlim.R
\name{mlim}
\alias{mlim}
\title{missing data imputation with automated machine learning}
\usage{
mlim(
  data = NULL,
  m = 1,
  algos = c("RF", "ELNET", "GBM"),
  preimputed.data = NULL,
  ignore = NULL,
  tuning_time = 180,
  max_models = NULL,
  maxiter = 10L,
  miniter = 2L,
  cv = 10L,
  matching = "AUTO",
  balance = NULL,
  weights_column = NULL,
  seed = NULL,
  verbosity = NULL,
  report = NULL,
  tolerance = 0,
  doublecheck = TRUE,
  cpu = -1,
  ram = NULL,
  flush = FALSE,
  init = TRUE,
  shutdown = TRUE,
  sleep = 0.5,
  save = NULL,
  load = NULL,
  force.load = TRUE,
  ...
)
}
\arguments{
\item{data}{a \code{data.frame} or \code{matrix} with missing data to be
imputed. if \code{load} is provided, this argument will be ignored.}

\item{m}{integer, specifying number of multiple imputations. the default value is
1, carrying out a single imputation.}

\item{algos}{character vector, specifying algorithms to be used for missing data
             imputation. the default is 'c("RF", "ELNET", "GBM")', which uses
             Random Forest for a fast initial imputation and then uses ELNET to
             improve the imputation and once ELNET stops improving, attempts using
             "GBM", as long as the 'maxiter' argument is not reached. in other words,
             "mlim" carries out 3 rounds of imputation, which are 1) preimputation with "RF",
             2) imputation with "ELNET", and 3) postimputation with "GBM". the reason for
             this setup is that in general, "RF" is faster than a fine-tuned "ELNET" and
             "ELNET" fine-tunes much faster than "GBM".

             in addition to these algorithms, \code{"DL"} (Deep Learning) and \code{"XGB"}
             (Extreme Gradient Boosting, only available in Mac OS and Linux) are also
             supported.  "GBM", "DL", "XGB", and "Ensemble" take the full given "tuning_time" (see below) to
       tune the best model for imputing he given variable.}

\item{preimputed.data}{data.frame. if you have used another software for missing
data imputation, you can still optimize the imputation
by handing the data.frame to this argument, which will
bypass the "preimpute" procedure.}

\item{ignore}{character vector of column names or index of columns that should
should be ignored in the process of imputation.}

\item{tuning_time}{integer. maximum runtime (in seconds) for fine-tuning the
imputation model for each variable in each iteration. the default
time is 600 seconds but for a large dataset, you
might need to provide a larger model development
time. this argument also influences \code{max_models},
see below.}

\item{max_models}{integer. maximum number of models that can be generated in
the proecess of fine-tuning the parameters. this value
default to 100, meaning that for imputing each variable in
each iteration, up to 100 models can be fine-tuned. increasing
this value should be consistent with increasing
\code{max_model_runtime_secs}, allowing the model to spend
more time in the process of individualized fine-tuning.
as a result, the better tuned the model, the more accurate
the imputed values are expected to be}

\item{maxiter}{integer. maximum number of iterations. the default value is \code{15},
but it can be reduced to \code{3} (not recommended, see below).}

\item{miniter}{integer. minimum number of iterations. the default value is
2.}

\item{cv}{logical. specify number of k-fold Cross-Validation (CV). values of
10 or higher are recommended. default is 10.}

\item{matching}{logical. if \code{TRUE}, imputed values are coerced to the
closest value to the non-missing values of the variable.
if set to "AUTO", 'mlim' decides whether to match
or not, based on the variable classes. the default is "AUTO".}

\item{balance}{character vector, specifying variable names that should be
balanced before imputation. balancing the prevalence might
decrease the overall accuracy of the imputation, because it
attempts to ensure the representation of the rare outcome.
this argument is optional and intended for advanced users that
impute a severely imbalance categorical (nominal) variable.}

\item{weights_column}{non-negative integer. a vector of observation weights
can be provided, which should be of the same length
as the dataframe. giving an observation a weight of
Zero is equivalent of ignoring that observation in the
model. in contrast, a weight of 2 is equivalent of
repeating that observation twice in the dataframe.
the higher the weight, the more important an observation
becomes in the modeling process. the default is NULL.}

\item{seed}{integer. specify the random generator seed}

\item{verbosity}{character. controls how much information is printed to console.
the value can be "warn" (default), "info", "debug", or NULL.}

\item{report}{filename. if a filename is specified (e.g. report = "mlim.md"), the \code{"md.log"} R
package is used to generate a Markdown progress report for the
imputation. the format of the report is adopted based on the
\code{'verbosity'} argument. the higher the verbosity, the more
technical the report becomes. if verbosity equals "debug", then
a log file is generated, which includes time stamp and shows
the function that has generated the message. otherwise, a
reduced markdown-like report is generated. default is NULL.}

\item{tolerance}{numeric. the minimum rate of improvement in estimated error metric
of a variable to qualify the imputation for another round of iteration,
if the \code{maxiter} is not yet reached. any improvement of imputation
is desirable.  however, specifying values above 0 can reduce the number
of required iterations at a marginal increase of imputation error.
for larger datasets, value of "1e-3" is recommended. note that the
best accuracy is reached when this value is equal to zero.}

\item{doublecheck}{logical. default is TRUE (which is conservative). if FALSE, if the estimated
imputation error of a variable does not improve, the variable
will be not reimputed in the following iterations. in general,
deactivating this argument will slightly reduce the imputation
accuracy, however, it significantly reduces the computation time.
if your dataset is large, you are advised to set this argument to
FALSE. (EXPERIMENTAL: consider that by avoiding several iterations
that marginally improve the imputation accuracy, you might gain
higher accuracy by investing your computational resources in fine-tuning
better algorithms such as "GBM")}

\item{cpu}{integer. number of CPUs to be dedicated for the imputation.
the default takes all of the available CPUs.}

\item{ram}{integer. specifies the maximum size, in Gigabytes, of the
memory allocation. by default, all the available memory is
used for the imputation.
large memory size is particularly advised, especially
for multicore processes. the more you give the more you get!}

\item{flush}{logical (experimental). if TRUE, after each model, the server is
cleaned to retrieve RAM. this feature is in testing mode.}

\item{init}{logical. should h2o Java server be initiated? the default is TRUE.
however, if the Java server is already running, set this argument
to FALSE.}

\item{shutdown}{logical. if TRUE, h2o server is closed after the imputation.
the default is TRUE}

\item{sleep}{integer. number of seconds to wait after each interaction with h2o
server. the default is 1 second. larger values might be needed
depending on your computation power or dataset size.}

\item{save}{(NOT YET IMPLEMENTED FOR R). filename. if a filename is specified, an \code{mlim} object is
saved after the end of each variable imputation. this object not only
includes the imputed dataframe and estimated cross-validation error, but also
includes the information needed for continuing the imputation,
which is very useful feature for imputing large datasets, with a
long runtime. this argument is activated by default and an
mlim object is stored in the local directory named \code{"mlim.rds"}.}

\item{load}{(NOT YET IMPLEMENTED FOR R). an object of class "mlim", which includes the data, arguments,
and settings for re-running the imputation, from where it was
previously stopped. the "mlim" object saves the current state of
the imputation and is particularly recommended for large datasets
or when the user specifies a computationally extensive settings
(e.g. specifying several algorithms, increasing tuning time, etc.).}

\item{force.load}{(NOT YET IMPLEMENTED FOR R).logical (default is TRUE). if TRUE, when loading the mlim class
object, its preserved settings are used for restoring and saving the
following itterations. otherwise, if FALSE, the current arguments of
mlim are used to overpower the settings of the mlim object. the settings
include the full list of the mlim arguments.}

\item{...}{Arguments passed to \code{h2o.automl()}.
The following arguments are e.g. incompatible with \code{ranger}: \code{write.forest}, \code{probability}, \code{split.select.weights}, \code{dependent.variable.name}, and \code{classification}.}
}
\value{
a \code{data.frame}, showing the
        estimated imputation error from the cross validation within the data.frame's
        attribution
}
\description{
imputes data.frame with mixed variable types using automated
             machine learning (AutoML)
}
\examples{

\donttest{
data(iris)
irisNA <- mlim.na(iris, p = 0.1, stratify = TRUE, seed = 2022)

# run the default imputation (fastest imputation via 'mlim')
MLIM <- mlim(irisNA)
mlim.error(MLIM, irisNA, iris)

# run GBM model and allow 15 minutes of tuning for each variable
MLIM <- mlim(irisNA, impute = "GBM", tuning_time=60*15)
mlim.error(MLIM, irisNA, iris)
}
}
\author{
E. F. Haghish
}
