% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/simBulk.R
\name{generateBulkCellMatrix}
\alias{generateBulkCellMatrix}
\title{Generate training and test cell composition matrices}
\usage{
generateBulkCellMatrix(
  object,
  cell.ID.column,
  cell.type.column,
  prob.design,
  num.bulk.samples,
  n.cells = 100,
  train.freq.cells = 2/3,
  train.freq.bulk = 2/3,
  proportions.train = c(10, 5, 20, 15, 35, 15),
  proportions.test = c(10, 5, 20, 15, 35, 15),
  prob.zero = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5),
  balanced.type.cells = FALSE,
  verbose = TRUE
)
}
\arguments{
\item{object}{\code{\linkS4class{DigitalDLSorter}} object with
\code{single.cell.real} slot and, optionally, with \code{single.cell.simul}
slot.}

\item{cell.ID.column}{Name or column number corresponding to the cell names
of expression matrix in cells metadata.}

\item{cell.type.column}{Name or column number corresponding to the cell type
of each cell in cells metadata.}

\item{prob.design}{Data frame with the expected frequency ranges for each
cell type present in the experiment. This information can be estimated from
literature or from the single-cell experiment itself. This data frame must
be constructed by three columns with specific headings (see examples):
\itemize{ \item A cell type column with the same name of the cell type
column in cells metadata (\code{cell.type.column}). If the name of the
column is not the same, the function will return an error. All cell types
must appear in the cells metadata. \item A second column called
\code{'from'} with the start frequency for each cell type. \item A third
column called \code{'to'} with the ending frequency for each cell type.}}

\item{num.bulk.samples}{Number of bulk RNA-Seq sample proportions (and thus
simulated bulk RNA-Seq samples) to be generated taking into account
training and test data. We recommend seting this value according to the
number of single-cell profiles available in
\code{\linkS4class{DigitalDLSorter}} object avoiding an excesive
re-sampling, but generating a large number of samples for better training.}

\item{n.cells}{Number of cells that will be aggregated in order to simulate
one bulk RNA-Seq sample (100 by default).}

\item{train.freq.cells}{Proportion of cells used to simulate training
pseudo-bulk samples (2/3 by default).}

\item{train.freq.bulk}{Proportion of bulk RNA-Seq samples to the total number
(\code{num.bulk.samples}) used for the training set (2/3 by default).}

\item{proportions.train}{Vector of six integers that determines the
proportions of bulk samples generated by the different methods (see Details
and Torroja and Sanchez-Cabo, 2019. for more information). This vector
represents proportions, so its entries must add up 100. By default, a
majority of random samples will be generated without using predefined
ranges.}

\item{proportions.test}{\code{proportions.train} for test samples.}

\item{prob.zero}{Probability of producing cell type proportions equal to
zero. It is a vector of six elements corresponding to the six methods of
producing cell type proportions (see \code{proportions.train} for more
details).}

\item{balanced.type.cells}{Boolean indicating whether the training and test
cells will be split in a balanced way considering the cell types
(\code{FALSE} by default).}

\item{verbose}{Show informative messages during the execution (\code{TRUE} by
default).}
}
\value{
A \code{\linkS4class{DigitalDLSorter}} object with
\code{prob.cell.types} slot containing a \code{list} with two
\code{\linkS4class{ProbMatrixCellTypes}} objects (training and test). For
more information about the structure of this class, see
\code{?\linkS4class{ProbMatrixCellTypes}}.
}
\description{
Generate training and test cell composition matrices for the simulation of
pseudo-bulk RNA-Seq samples with known cell composition using single-cell
expression profiles. The resulting \code{\linkS4class{ProbMatrixCellTypes}}
object contains a matrix that determines the proportion of the different cell
types that will compose the simulated pseudo-bulk samples. In addition, this
object also contains other information relevant for the process. This
function does not simulate pseudo-bulk samples, this task is performed by the
\code{\link{simBulkProfiles}} or \code{\link{trainDigitalDLSorterModel}}
functions (see Documentation).
}
\details{
First, the available single-cell profiles are split into training and test
subsets (2/3 for training and 1/3 for test by default (see
\code{train.freq.cells})) to avoid falsifying the results during model
evaluation. Next, \code{num.bulk.samples} bulk samples proportions are built
and the single-cell profiles to be used to simulate each pseudo-bulk RNA-Seq
sample are set, being 100 cells per bulk sample by default (see
\code{n.cells} argument). The proportions of training and test pseudo-bulk
samples are set by \code{train.freq.bulk} (2/3 for training and 1/3 for
testing by default). Finally, in order to avoid biases due to the composition
of the pseudo-bulk RNA-Seq samples, cell type proportions (\eqn{w_1,...,w_k},
where \eqn{k} is the number of cell types available in single-cell profiles)
are randomly generated by using six different approaches:

\enumerate{ \item Cell proportions are randomly sampled from a truncated
uniform distribution with predefined limits according to a priori knowledge
of the abundance of each cell type (see \code{prob.design} argument). This
information can be inferred from the single-cell experiment itself or from
the literature. \item A second set is generated by randomly permuting cell
type labels from a distribution generated by the previous method. \item Cell
proportions are randomly sampled as by method 1 without replacement. \item
Using the last method for generating proportions, cell types labels are
randomly sampled. \item Cell proportions are randomly sampled from a
Dirichlet distribution. \item Pseudo-bulk RNA-Seq samples composed of the
same cell type are generated in order to provide 'pure' pseudo-bulk samples.}

If you want to inspect the distribution of cell type proportions generated by
each method during the process, they can be visualized by the
\code{\link{showProbPlot}} function (see Documentation).
}
\examples{
set.seed(123) # reproducibility
# simulated data
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(30, lambda = 5), nrow = 15, ncol = 10, 
      dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(10)),
    Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10, 
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(15))
  )
)
DDLS <- loadSCProfiles(
  single.cell.data = sce,
  cell.ID.column = "Cell_ID",
  gene.ID.column = "Gene_ID"
)
probMatrixValid <- data.frame(
  Cell_Type = paste0("CellType", seq(2)),
  from = c(1, 30),
  to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
  object = DDLS,
  cell.ID.column = "Cell_ID",
  cell.type.column = "Cell_Type",
  prob.design = probMatrixValid,
  num.bulk.samples = 10,
  verbose = TRUE
)

}
\references{
Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep
Learning algorithm to quantify immune cell populations based on scRNA-Seq
data. Frontiers in Genetics 10, 978. doi: \doi{10.3389/fgene.2019.00978}
}
\seealso{
\code{\link{simBulkProfiles}}
\code{\linkS4class{ProbMatrixCellTypes}}
}
