% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/getLowLcpmCutoff.R
\name{getLowLcpmCutoff}
\alias{getLowLcpmCutoff}
\title{Function to empirically determine a log2 CPM cutoff based on ERCC RNA spike-in}
\usage{
getLowLcpmCutoff(
  obs,
  exp,
  pairs,
  n.bins = 7,
  rep = 1000,
  ci = 0.95,
  cor.value = 0.9,
  remove.outliers = TRUE,
  seed = 20220719
)
}
\arguments{
\item{obs}{A data frame  of observed spike-in ERCC data.  Each row is an ERCC
transcript, and each column is a sample.  Data are read
coverage-normalized log2 counts per million (LCPM).}

\item{exp}{A data frame of expected ERCC Mix 1 and Mix 2 ratios with a column
titled `expected_lfc_ratio` containing the expected log2 fold-change
 ratios. This data can be obtained from 'ERCC Controls Analysis' manual
 located on Thermo Fisher's ERCC RNA Spike-In Mix product
 [page](https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095046.txt).
 The 'exp_input' data frame mirrors the fields shown in the ERCC manual.
 For the LCPM cutoff calculation, the last column containing the log2
 expected fold change ratios are used.  Ensure that this column is titled
 "expected_lfc_ratio". See the example code below for formatting the data.     #}

\item{pairs}{A 2-column data frame where each row indicates a sample pair
with the first column indicating the sample that received ERCC spike-ins
from Mix 1 and the second column indicating the sample receiving Mix 2.}

\item{n.bins}{Integer.  The number of abundance bins to create.  Default is 7.}

\item{rep}{Integer.  The number of bootstrap replicates.  Default is 1000.}

\item{ci}{Numeric.  The confidence interval.  Default is 0.95.}

\item{cor.value}{Numeric.  The desired Spearman correlation between the
empirical log2 fold change across the ERCC transcripts.  Default is 0.9.}

\item{remove.outliers}{If TRUE (default) outliers are identified as exceeding
1.5 IQR, and are removed prior to fitting the polynomial. Set to FALSE
to keep all points.}

\item{seed}{Integer.  The reproducibility seed.  Default is 20220719.}
}
\value{
An "empLCPM" object is returned, which contains the following named
elements:
\tabular{ll}{
   \code{cutoff} \tab a vector containing 3 values: the threshold value,
   upper confidence interval, \cr \tab and the lower confidence interval value. \cr
   \code{args} \tab a key: value list of arguments that were provided. \cr
   \code{res} \tab a list containing the main results and other
   information from the input. \cr \tab The \code{\link{summary.empLCPM}}
   function should be used to extract a summary table. \cr
 }
}
\description{
This function uses spike-in ERCC data, known control RNA probes,
   and paired samples to fit a 3rd order polynomial to determine an expression
   cutoff that meets the specified correlation between expected and observed
   fold changes.  The \code{obs} data frame used as input for the observed
   expression of the 92 ERCC RNA spike-ins and stores the coverage-normalized
   read log2 counts per million (LCPM) that mapped to the respective ERCC
   sequences.  Typically, prior to LCPM calculation, the read count data is
   normalized for any systematic differences in read coverage between samples,
   for example, by using the TMM normalization method as implemented in
   the \code{edgeR} package.

   For each bootstrap replicate, the paired samples are subsampled with
   replacement.  The mean LCPM of each ERCC transcript is determined by
   first calculating the average LCPM value for each paired sample, and
   then taking the mean of those averages. The ERCC transcripts are sorted
   based on these means, and are then grouped into \code{n.bins} ERCC bins.
   Next, the Spearman correlation metric is used to calculate the association
   between the empirical and expected log fold change (LFC) of the ERCCs in
   each bin for each sample.
   Additionally, the average LCPM for the ERCCs in each bin are calculated
   for each sample. This leads to a pair of values - the average LCPM and the
   association value - for each sample and each ERCC bin.  Outliers within
   each ERCC bin are identified and removed based on >1.5 IQR.
   A 3rd order polynomial is fit with the explanatory variable being the
   average LCPM and the response variable being the Spearman correlation
   value between expected and observed log2 fold changes.
   The fitted curve is used to identify the average LCPM value with a Spearman
   correlation of \code{cor.value}. The results are output as an "empLCM"
   object as described below.  The \code{\link{summary.empLCPM}} function can
   be used to extract a summary of the results, and the
   \code{\link{plot.empLCPM}} function to plot the results for visualization.
}
\examples{
library(CpmERCCutoff)
##############################
# Load and wrangle input data:
##############################
# Load observed read counts
data("obs_input")

# Set ERCC Ids to rownames
rownames(obs_input) = obs_input$X

# Load expected ERCC data:
data("exp_input")

# Order rows by ERCC ID.
exp_input = exp_input[order(exp_input$ercc_id), ]
rownames(exp_input) = exp_input$ercc_id

# Load metadata:
data("mta_dta")

# Pair samples that received ERCC Mix 1 with samples that received ERCC Mix 2.
# The resulting 2-column data frame is used for the 'pairs' argument.
# Note: the code here will depend on the details of the given experiment. In
#       this example, the post-vaccination samples (which received Mix 2)
#       for each subject are paired to their pre-vaccination samples (which
#       received Mix 1).
pairs_input = cbind(
  mta_dta[mta_dta$spike == 2, 'samid'],
  mta_dta[match(mta_dta[mta_dta$spike == 2, 'subid'],
                mta_dta[mta_dta$spike == 1,'subid']), 'samid'])
# Put Mix 1 in the first column and Mix 2 in the second.
pairs_input = pairs_input[, c(2, 1)]

###############################
# Run getLowLcpmCutoff Function:
###############################'
# Note: Here we use `rep = 10` for only 10 bootstrap replicates
#       to decrease the run time for this example; a lager number
#       should be used in practice (default = 1000).
res = getLowLcpmCutoff(obs = obs_input,
                       exp = exp_input,
                       pairs = pairs_input,
                       n.bins = 7,
                       rep = 10,
                       cor.value = 0.9,
		                  remove.outliers = TRUE,
                       seed = 20220719)

# Print a short summary of the results:
res

# Extract a summary table of the results:
summary(res)

# Create a plot of the results:
plot(x = res,
     main = "Determination of Empirical Minimum Expression Cutoffs using ERCCs",
     col.trend = "blue",
     col.outlier = c("black", "red"))

}
\seealso{
\code{\link{summary.empLCPM}}, \code{\link{plot.empLCPM}},
\code{\link{print.empLCPM}}
}
