\name{readCorpus}
\alias{readCorpus}
%- Also NEED an '\alias' for EACH other topic documented here.
\title{
Read in a corpus file.
}
\description{
Converts pre-processed document matrices stored in popular formats to stm format.  
}
\usage{
readCorpus(corpus, type = c("dtm", "ldac", "slam", "Matrix", "txtorgvocab"))
}
%- maybe also 'usage' for other objects documented here.
\arguments{
  \item{corpus}{
  An input file or filepath to be processed  
}
  \item{type}{
  The type of input file.  We offer several sources, see details.
}
}
\details{
This function provides a simple utility for converting other document formats to our own.  Briefly- \code{dtm} takes as input a standard matrix and converts to our format  \code{ldac} takes a file path and reads in a document in the sparse format popularized by David Blei's C code implementation of lda.  \code{slam} converts from the \code{simple_triplet_matrix} representation used by the \code{slam} package. This is also the representation of corpora in the popular \code{tm} package and should work in those cases.  

\code{dtm} expects a matrix object where each row represents a document and each column represents a word in the dictionary.

\code{ldac} expects a file name or path that contains a file in Blei's LDA-C format. From his ReadMe: 
"The data is a file where each line is of the form:

     [M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]

where [M] is the number of unique terms in the document, and the
[count] associated with each term is how many times that term appeared
in the document.  Note that [term_1] is an integer which indexes the
term; it is not a string."  

Because R indexes from one, the values of the term indices are incremented by one on import.

\code{slam} expects a \code{\link[slam]{simple_triplet_matrix}} from that package.

\code{Matrix} attempts to coerce the matrix to a \code{\link[slam]{simple_triplet_matrix}} and convert using the functionality built for the \code{slam} package.  This will work for most applicable classes in the \code{Matrix} package such as \code{dgCMatrix}.

Finally the object \code{txtorgvocab} allows the user to easily read in a vocab file generated by the software \code{txtorg}.  When working in English it is straightforward to read in files created by txtorg.  However when working in other languages, particularly Chinese and Arabic, there can often be difficulty reading in the files using \code{\link{read.table}} or \code{\link{read.csv}}  This function should work well in those circumstances.
}
\value{
\item{documents}{A documents object in our format}
\item{vocab}{A vocab object if information is available to construct one}
}

\seealso{
\code{\link{textProcessor}}, \code{\link{prepDocuments}}
}
\examples{
library(textir)
data(congress109)
out <- readCorpus(congress109Counts, type="Matrix")
documents <- out$documents
vocab <- out$vocab
}