% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/fit.R
\name{fit.infer}
\alias{fit.infer}
\title{Fit linear models to infer objects}
\usage{
\method{fit}{infer}(object, ...)
}
\arguments{
\item{object}{Output from an infer function---likely \code{\link[=generate]{generate()}} or
\code{\link[=specify]{specify()}}---which specifies the formula and data to fit a model to.}

\item{...}{Any optional arguments to pass along to the model fitting
function. See \code{\link[stats:glm]{stats::glm()}} for more information.}
}
\value{
A \link[tibble:tibble]{tibble} containing the following columns:

\itemize{
\item \code{replicate}: Only supplied if the input object had been previously
passed to \code{\link[=generate]{generate()}}. A number corresponding to which resample of the
original data set the model was fitted to.
\item \code{term}: The explanatory variable (or intercept) in question.
\item \code{estimate}: The model coefficient for the given resample (\code{replicate}) and
explanatory variable (\code{term}).
}
}
\description{
Given the output of an infer core function, this function will fit
a linear model using \code{\link[stats:glm]{stats::glm()}} according to the formula and data supplied
earlier in the pipeline. If passed the output of \code{\link[=specify]{specify()}} or
\code{\link[=hypothesize]{hypothesize()}}, the function will fit one model. If passed the output
of \code{\link[=generate]{generate()}}, it will fit a model to each data resample, denoted in
the \code{replicate} column. The family of the fitted model depends on the type
of the response variable. If the response is numeric, \code{fit()} will use
\code{family = "gaussian"} (linear regression). If the response is a 2-level
factor or character, \code{fit()} will use \code{family = "binomial"} (logistic
regression). To fit character or factor response variables with more than
two levels, we recommend \code{\link[parsnip:multinom_reg]{parsnip::multinom_reg()}}.

infer provides a fit "method" for infer objects, which is a way of carrying
out model fitting as applied to infer output. The "generic," imported from
the generics package and re-exported from this package, provides the
general form of \code{fit()} that points to infer's method when called on an
infer object. That generic is also documented here.

Learn more in \code{vignette("infer")}.
}
\details{
Randomization-based statistical inference with multiple explanatory
variables requires careful consideration of the null hypothesis in question
and its implications for permutation procedures. Inference for partial
regression coefficients via the permutation method implemented in
\code{\link[=generate]{generate()}} for multiple explanatory variables, consistent with its meaning
elsewhere in the package, is subject to additional distributional assumptions
beyond those required for one explanatory variable. Namely, the distribution
of the response variable must be similar to the distribution of the errors
under the null hypothesis' specification of a fixed effect of the explanatory
variables. (This null hypothesis is reflected in the \code{variables} argument to
\code{\link[=generate]{generate()}}. By default, all of the explanatory variables are treated
as fixed.) A general rule of thumb here is, if there are large outliers
in the distributions of any of the explanatory variables, this distributional
assumption will not be satisfied; when the response variable is permuted,
the (presumably outlying) value of the response will no longer be paired
with the outlier in the explanatory variable, causing an outsize effect
on the resulting slope coefficient for that explanatory variable.

More sophisticated methods that are outside of the scope of this package
requiring fewer---or less strict---distributional assumptions
exist. For an overview, see "Permutation tests for univariate or
multivariate analysis of variance and regression" (Marti J. Anderson,
2001), \doi{10.1139/cjfas-58-3-626}.
}
\section{Reproducibility}{
When using the infer package for research, or in other cases when exact
reproducibility is a priority, be sure the set the seed for R’s random
number generator. infer will respect the random seed specified in the
\code{set.seed()} function, returning the same result when \code{generate()}ing
data given an identical seed. For instance, we can calculate the
difference in mean \code{age} by \code{college} degree status using the \code{gss}
dataset from 10 versions of the \code{gss} resampled with permutation using
the following code.\if{html}{\out{<div class="r">}}\preformatted{set.seed(1)

gss \%>\%
  specify(age ~ college) \%>\%
  hypothesize(null = "independence") \%>\%
  generate(reps = 5, type = "permute") \%>\%
  calculate("diff in means", order = c("degree", "no degree"))
}\if{html}{\out{</div>}}\preformatted{## Response: age (numeric)
## Explanatory: college (factor)
## Null Hypothesis: independence
## # A tibble: 5 × 2
##   replicate   stat
##       <int>  <dbl>
## 1         1 -0.531
## 2         2 -2.35 
## 3         3  0.764
## 4         4  0.280
## 5         5  0.350
}

Setting the seed to the same value again and rerunning the same code
will produce the same result.\if{html}{\out{<div class="r">}}\preformatted{# set the seed
set.seed(1)

gss \%>\%
  specify(age ~ college) \%>\%
  hypothesize(null = "independence") \%>\%
  generate(reps = 5, type = "permute") \%>\%
  calculate("diff in means", order = c("degree", "no degree"))
}\if{html}{\out{</div>}}\preformatted{## Response: age (numeric)
## Explanatory: college (factor)
## Null Hypothesis: independence
## # A tibble: 5 × 2
##   replicate   stat
##       <int>  <dbl>
## 1         1 -0.531
## 2         2 -2.35 
## 3         3  0.764
## 4         4  0.280
## 5         5  0.350
}

Please keep this in mind when writing infer code that utilizes
resampling with \code{generate()}.
}

\examples{
# fit a linear model predicting number of hours worked per
# week using respondent age and degree status.
observed_fit <- gss \%>\%
  specify(hours ~ age + college) \%>\%
  fit()

observed_fit

# fit 100 models to resamples of the gss dataset, where the response 
# `hours` is permuted in each. note that this code is the same as 
# the above except for the addition of the `generate` step.
null_fits <- gss \%>\%
  specify(hours ~ age + college) \%>\%
  hypothesize(null = "independence") \%>\%
  generate(reps = 100, type = "permute") \%>\%
  fit()

null_fits

# for logistic regression, just supply a binary response variable!
# (this can also be made explicit via the `family` argument in ...)
gss \%>\%
  specify(college ~ age + hours) \%>\%
  fit()

# more in-depth explanation of how to use the infer package
\dontrun{
vignette("infer")
}  

}
