% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/perturbation-clustering.R
\name{PerturbationClustering}
\alias{PerturbationClustering}
\title{Perturbation clustering}
\usage{
PerturbationClustering(
  data,
  kMax = 5,
  verbose = T,
  ncore = 1,
  clusteringMethod = "kmeans",
  clusteringFunction = NULL,
  clusteringOptions = NULL,
  perturbMethod = "noise",
  perturbFunction = NULL,
  perturbOptions = NULL,
  PCAFunction = NULL,
  iterMin = 20,
  iterMax = 200,
  madMin = 0.001,
  msdMin = 1e-06
)
}
\arguments{
\item{data}{Input matrix. The rows represent items while the columns represent features.}

\item{kMax}{The maximum number of clusters. The algorithm runs from \code{k = 2} to \code{k = kMax}. Default value is \code{5}.}

\item{verbose}{Boolean value indicating the algorithm to run with or without logging. Default value is \code{TRUE}.}

\item{ncore}{Number of cores that the algorithm should use. Default value is \code{1}.}

\item{clusteringMethod}{The name of built-in clustering algorithm that PerturbationClustering will use. Currently supported algorithm are \code{kmeans}, \code{pam} and \code{hclust}. Default value is "\code{kmeans}".}

\item{clusteringFunction}{The clustering algorithm function that will be used instead of built-in algorithms.}

\item{clusteringOptions}{A list of parameter will be passed to the clustering algorithm in \code{clusteringMethod}.}

\item{perturbMethod}{The name of built-in perturbation method that PerturbationClustering will use, currently supported methods are \code{noise} and \code{subsampling}. Default value is "\code{noise}".}

\item{perturbFunction}{The perturbation method function that will be used instead of built-in ones.}

\item{perturbOptions}{A list of parameter will be passed to the perturbation method in \code{perturbMethod}.}

\item{PCAFunction}{The customized PCA function that user can manually define.}

\item{iterMin}{The minimum number of iterations. Default value is \code{20}.}

\item{iterMax}{The maximum number of iterations. Default value is \code{200}.}

\item{madMin}{The minimum of Mean Absolute Deviation of \code{AUC} of Connectivity matrix for each \code{k}. Default value is \code{1e-03}.}

\item{msdMin}{The minimum of Mean Square Deviation of \code{AUC} of Connectivity matrix for each \code{k}. Default value is \code{1e-06}.}
}
\value{
\code{PerturbationClustering} returns a list with at least the following components:
\item{k}{The optimal number of clusters}
\item{cluster}{A vector of labels indicating the cluster to which each sample is allocated}
\item{origS}{A list of original connectivity matrices}
\item{pertS}{A list of perturbed connectivity matrices}
}
\description{
Perform subtyping using one type of high-dimensional data
}
\details{
PerturbationClustering implements the Perturbation Clustering algorithm of Nguyen, et al (2017).
It aims to determine the optimum cluster number and location of each sample in the clusters in an unsupervised analysis.

PerturbationClustering takes input as a numerical matrix or data frame of items as rows and features as columns.
It uses a clustering algorithm as the based algorithm.
Current built-in algorithms that users can use directly are \code{kmeans}, \code{pam} and \code{hclust}.
The default parameters for built-in \code{kmeans} are \code{nstart = 20 and iter.max = 1000}.
Users can change the parameters of built-in clustering algorithm by passing the value into \code{clusteringOptions}.

PerturbationClustering also allows users to pass their own clustering algorithm instead of using built-in ones by using \code{clusteringFunction} parameter. 
Once \code{clusteringFunction} is specified, \code{clusteringMethod} will be skipped.
The value of \code{clusteringFunction} must be a function that takes two arguments: \code{data} and \code{k}, 
where \code{data} is a numeric matrix or data frame containing data that need to be clustered, and \code{k} is the number of clusters.
\code{clusteringFunction} must return a vector of labels indicating the cluster to which each sample is allocated.

PerturbationClustering uses a perturbation method to perturb clustering input data.
There are two built-in methods are \code{noise} and \code{subsampling} that users can use directly by passing to \code{perturbMethod} parameter.
Users can change the default value of built-in perturbation methods by passing new value into \code{perturbOptions}:

1. \code{noise} perturbation method takes two arguments: \code{noise} and \code{noisePercent}. The default values are \code{noise = NULL and noisePercent = "median"}.
If \code{noise} is specified. \code{noisePercent} will be skipped.\cr
2. \code{subsampling} perturbation method takes one argument \code{percent} which has default value of \code{80}

Users can also use their own perturbation methods by passing them into \code{perturbFunction}. 
Once \code{perturbFunction} is specified, \code{perturbMethod} will be skipped.
The value of \code{perturbFunction} must be a function that takes one argument \code{data}
- a numeric matrix or data frame containing data that need to be perturbed.
\code{perturbFunction} must return an object list which is as follows:

1. \code{data}: the perturbed data\cr
2. \code{ConnectivityMatrixHandler}: a function that takes three arguments:
\code{connectivityMatrix} - the connectivity matrix generated after clustering returned \code{data}, 
\code{iter} - the current iteration and \code{k} - the number of cluster. 
This function must return a compatible connectivity matrix with the original connectivity matrix. 
This function aims to correct the \code{connectivityMatrix} if needed and returns the corrected version of it.\cr
3. \code{MergeConnectivityMatrices}: a function that takes four arguments: \code{oldMatrix}, \code{newMatrix}, \code{k} and \code{iter}. 
The \code{oldMatrix} and \code{newMatrix} are two connectivity matrices that need to be merged,
\code{k} is the cluster number and \code{iter} is the current number of iteration.
This function must returns a connectivity matrix that is merged from \code{oldMatrix} and \code{newMatrix}
}
\examples{
\donttest{
# Load the dataset AML2004
data(AML2004)
data <- as.matrix(AML2004$Gene)
# Perform the clustering
result <- PerturbationClustering(data = data)

# Plot the result
condition = seq(unique(AML2004$Group[, 2]))
names(condition) <- unique(AML2004$Group[, 2])
plot(
    prcomp(data)$x,
    col = result$cluster,
    pch = condition[AML2004$Group[, 2]],
    main = "AML2004"
)
legend(
    "bottomright",
    legend = paste("Cluster ", sort(unique(result$cluster)), sep = ""),
    fill = sort(unique(result$cluster))
)
legend("bottomleft", legend = names(condition), pch = condition)

# Change kmeans parameters
result <- PerturbationClustering(
    data = data,
    clusteringMethod = "kmeans",
    clusteringOptions = list(
        iter.max = 500,
        nstart = 50
    )
)

# Change to use pam
result <- PerturbationClustering(data = data, clusteringMethod = "pam")

# Change to use hclust
result <- PerturbationClustering(data = data, clusteringMethod = "hclust")

# Pass a user-defined clustering algorithm
result <- PerturbationClustering(data = data, clusteringFunction = function(data, k){
    # this function must return a vector of cluster
    kmeans(x = data, centers = k, nstart = k*10, iter.max = 2000)$cluster
})      

# Use noise as the perturb method
result <- PerturbationClustering(data = data, 
                                 perturbMethod = "noise", 
                                 perturbOptions = list(noise = 0.3))
# or
result <- PerturbationClustering(data = data, 
                                 perturbMethod = "noise", 
                                 perturbOptions = list(noisePercent = 10))

# Change to use subsampling
result <- PerturbationClustering(data = data, 
                                 perturbMethod = "subsampling", 
                                 perturbOptions = list(percent = 90))

# Users can pass their own perturb method
result <- PerturbationClustering(data = data, perturbFunction = function(data){
   rowNum <- nrow(data)
   colNum <- ncol(data)
   epsilon <-
       matrix(
           data = rnorm(rowNum * colNum, mean = 0, sd = 1.234),
           nrow = rowNum,
           ncol = colNum
       )
   
   list(
       data = data + epsilon,
       ConnectivityMatrixHandler = function(connectivityMatrix, ...) {
           connectivityMatrix
       },
       MergeConnectivityMatrices = function(oldMatrix, newMatrix, iter, ...){
           return((oldMatrix*(iter-1) + newMatrix)/iter)
       }
   )
})
}
}
\references{
1. T Nguyen, R Tagett, D Diaz, S Draghici. A novel method for data integration and disease subtyping. Genome Research, 27(12):2025-2039, 2017.

2. T. Nguyen, "Horizontal and vertical integration of bio-molecular data", PhD thesis, Wayne State University, 2017.
}
\seealso{
\code{\link{kmeans}}, \code{\link{pam}}
}
