% Generated by roxygen2 (4.0.2): do not edit by hand
\name{calculate_emd}
\alias{calculate_emd}
\title{Earth Mover's Distance for differential analysis of genomics data}
\usage{
calculate_emd(data, samplesA, samplesB, binSize = 0.2, nperm = 100,
  verbose = TRUE)
}
\arguments{
\item{data}{A matrix containing genomics data (e.g. gene expression levels).
The rownames should contain gene identifiers, while the column names should
contain sample identifiers.}

\item{samplesA}{A vector of sample names identifying samples in \code{data}
that belong to "group A". The names must corresponding to column names
in \code{data}.}

\item{samplesB}{A vector of sample names identifying samples in \code{data}
that belong to "group B". The names must corresponding to column names
in \code{data}.}

\item{binSize}{The bin size to be used when generating histograms of
the data for "group A" and "group B". Defaults to 0.2.}

\item{nperm}{An integer specifying the number of randomly permuted EMD
scores to be computed. Defaults to 100.}

\item{verbose}{Boolean specifying whether to display progress messages.}
}
\value{
The function returns an \code{\link{EMDomics}} object.
}
\description{
This is the main user interface to the \pkg{EMDomics} package, and
will usually the only function needed.

The algorithm is used to compare genomics data between two groups, refered to
herein as "group A" and "group B". Usually the data will be gene expression
values from array-based or sequence-based experiments, but data from other
types of experiments can also be analyzed (i.e. copy number variation).

Traditional methods like Significance Analysis of Microarrays (SAM) and Linear
Models for Microarray Data (LIMMA) use significance tests based on summary
statistics (mean and standard deviation) of the two distributions. This
approach tends to give non-significant results if the two distributions are
highly heterogeneous, which can be the case in many biological circumstances
(e.g sensitive vs. resistant tumor samples).

The Earth Mover's Distance algorithm instead computes the "work" needed
to transform one distribution into the other, thus capturing possibly
valuable information relating to the overall difference in shape between
two heterogeneous distributions.

The EMD-based algorithm implemented in \pkg{EMDomics} has two main steps.
First, a matrix (e.g. of expression data) is divided into data for "group A"
and "group B", and the EMD score is calculated using the two groups for each
gene in the data set. Next, the labels for group A and group B are randomly
permuted a specified number of times, and an EMD score for each permutation is
calculated. The median of the permuted scores for each gene is used as
the null distribution, and the False Discovery Rate (FDR) is computed for
a range of permissive to restrictive significance thresholds. The threshold
that minimizes the FDR is defined as the q-value, and is used to interpret
the significance of the EMD score analogously to a p-value (e.g. q-value
< 0.05 = significant.)

Note that q-values of 0 are adjusted to 1/(nperm+1). For this reason, the
\code{nperm} parameter should not be too low (the default of 100 is
reasonable).
}
\examples{
# 100 genes, 100 samples
dat <- matrix(rnorm(10000), nrow=100, ncol=100)
rownames(dat) <- paste("gene", 1:100, sep="")
colnames(dat) <- paste("sample", 1:100, sep="")

# "group A" = first 50, "group B" = second 50
groupA <- colnames(dat)[1:50]
groupB <- colnames(dat)[51:100]
results <- calculate_emd(dat, groupA, groupB, nperm=10)
head(results$emd)
}
\seealso{
\code{\link{EMDomics}} \code{\link[emdist]{emd2d}}
}

