\name{tokenize}
\alias{tokenize}
\title{A simple tokenizer}
\usage{
  tokenize(txt, format = "file", fileEncoding = NULL,
    split = "[[:space:]]", ign.comp = "-",
    heuristics = "abbr",
    heur.fix = list(pre = c("’", "'"), suf = c("’", "'")),
    abbrev = NULL, tag = TRUE, lang = "kRp.env",
    sentc.end = c(".", "!", "?", ";", ":"),
    detect = c(parag = FALSE, hline = FALSE),
    clean.raw = NULL, perl = FALSE, stopwords = NULL,
    stemmer = NULL)
}
\arguments{
  \item{txt}{Either an open connection, the path to
  directory with txt files to read and tokenize, or a
  vector object already holding the text corpus.}

  \item{format}{Either "file" or "obj", depending on
  whether you want to scan files or analyze the given
  object.}

  \item{fileEncoding}{A character string naming the
  encoding of all files.}

  \item{split}{A regular expression to define the basic
  split method. Should only need refinement for languages
  that don't separate words by space.}

  \item{ign.comp}{A character vector defining punctuation
  which might be used in composita that should not be
  split.}

  \item{heuristics}{A vector to indicate if the tokenizer
  should use some heuristics. Can be none, one or several
  of the following: \itemize{ \item{\code{"abbr"}}{Assume
  that "letter-dot-letter-dot" combinations are
  abbreviations and leave them intact.}
  \item{\code{"suf"}}{Try to detect possesive suffixes like
  "'s", or shorting suffixes like "'ll" and treat them as
  one token} \item{\code{"pre"}}{Try to detect prefixes
  like "s'" or "l'" and treat them as one token} } Earlier
  releases used the names \code{"en"} and \code{"fr"}
  instead of \code{"suf"} and \code{"pre"}. They are still
  working, that is \code{"en"} is equivalent to
  \code{"suf"}, whereas \code{"fr"} is now equivalent to
  both \code{"suf"} and \code{"pre"} (and not only
  \code{"pre"} as in the past, which was missing the use of
  suffixes in French).}

  \item{heur.fix}{A list with the named vectors \code{pre}
  and \code{suf}. These will be used if \code{heuristics}
  were set to use one of the presets that try to detect
  pre- and/or suffixes. Change them if you document uses
  other characters than the ones defined by default.}

  \item{abbrev}{Path to a text file with abbreviations to
  take care of, one per line. Note that this file must have
  the same encoding as defined by \code{fileEncoding}.}

  \item{tag}{Logical. If \code{TRUE}, the text will be
  rudimentarily tagged and returned as an object of class
  \code{kRp.tagged}.}

  \item{lang}{A character string naming the language of the
  analyzed text. If set to \code{"kRp.env"} this is got
  from \code{\link[koRpus:get.kRp.env]{get.kRp.env}}. Only
  needed if \code{tag=TRUE}.}

  \item{sentc.end}{A character vector with tokens
  indicating a sentence ending. Only needed if
  \code{tag=TRUE}.}

  \item{detect}{A named logical vector, indicating by the
  setting of \code{parag} and \code{hline} whether
  \code{tokenize} should try to detect paragraphs and
  headlines.}

  \item{clean.raw}{A named list of character values,
  indicating replacements that should globally be made to
  the text prior to tokenizing it.  This is applied after
  the text was converted into UTF-8 internally. In the
  list, the name of each element represents a pattern which
  is replaced by its value if met in the text. Since this
  is done by calling \code{\link[base:gsub]{gsub}}, regular
  expressions are basically supported. See the \code{perl}
  attribute, too.}

  \item{perl}{Logical, only relevant if \code{clean.raw} is
  not \code{NULL}. If \code{perl=TRUE}, this is forwarded
  to \code{\link[base:gsub]{gsub}} to allow for perl-like
  regular expressions in \code{clean.raw}.}

  \item{stopwords}{A character vector to be used for
  stopword detection. Comparison is done in lower case. You
  can also simply set \code{stopwords=tm::stopwords("en")}
  to use the english stopwords provided by the \code{tm}
  package.}

  \item{stemmer}{A function or method to perform stemming.
  For instance, you can set
  \code{stemmer=Snowball::SnowballStemmer} if you have the
  \code{Snowball} package installed (or
  \code{SnowballC::wordStem}). As of now, you cannot
  provide further arguments to this function.}
}
\value{
  If \code{tag=FALSE}, a character vector with the
  tokenized text. If \code{tag=TRUE}, returns an object of
  class \code{\link[koRpus]{kRp.tagged-class}}.
}
\description{
  This tokenizer can be used to try replace TreeTagger. Its
  results are not as detailed when it comes to word
  classes, and no lemmatization is done. However, for most
  cases this should suffice.
}
\details{
  \code{tokenize} can try to guess what's a headline and
  where a paragraph was inserted (via the \code{detect}
  parameter). A headline is assumed if a line of text
  without sentence ending punctuation is found, a paragraph
  if two blocks of text are separated by space. This will
  add extra tags into the text: "<kRp.h>" (headline
  starts), "</kRp.h>" (headline ends) and "<kRp.p/>"
  (paragraph), respectively. This can be useful in two
  cases: "</kRp.h>" will be treated like a sentence ending,
  which gives you more control for automatic analyses. And
  adding to that,
  \code{\link[koRpus:kRp.text.paste]{kRp.text.paste}} can
  replace these tags, which probably preserves more of the
  original layout.
}
\examples{
\dontrun{
tokenized.obj <- tokenize("~/mydata/corpora/russian_corpus/")

## character manipulation
# this is useful if you know of problematic characters in your
# raw text files, but don't want to touch them directly. you
# don't have to, as you can substitute them, even using regular
# expressions. a simple example: replace all single quotes by
# double quotes througout the text:
tokenized.obj <- tokenize("~/my.data/speech.txt",
   clean.raw=list("'"="\\""))
# now replace all occurrances of the letter A followed
# by two digits with the letter B, followed by the same
# two digits:
tokenized.obj <- tokenize("~/my.data/speech.txt",
   clean.raw=list("(A)([[:digit:]]{2})"="B\\\\2"),
   perl=TRUE)

## enabling stopword detection and stemming
# if you also installed the packages tm and Snowball,
# you can use some of their features with koRpus:
tokenized.obj <- tokenize("~/my.data/speech.txt",
   stopwords=tm::stopwords("en"),
   stemmer=Snowball::SnowballStemmer)
# alternatively, use the SnowballC package:
tokenized.obj <- tokenize("~/my.data/speech.txt",
   stopwords=tm::stopwords("en"),
   stemmer=SnowballC::wordStem)

# removing all stopwords now is simple:
tokenized.noStopWords <- kRp.filter.wclass(tokenized.obj, "stopword")
}
}
\keyword{misc}

