% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/recombine.R
\name{epub_recombine}
\alias{epub_recombine}
\title{Recombine text sections}
\usage{
epub_recombine(data, pattern, sift = NULL)
}
\arguments{
\item{data}{a data frame created by \code{epub}.}

\item{pattern}{character, a regular expression.}

\item{sift}{\code{NULL} or a named list of parameters passed to \code{\link{epub_sift}}. See details.}
}
\value{
a data frame
}
\description{
Split and recombine EPUB text sections based on regular expression pattern matching.
}
\details{
This function takes a regular expression and uses it to determine new break points for the full e-book text.
This is particularly useful when sections pulled from EPUB metadata have arbitrary breaks and the text contains meaningful breaks at random locations in various sections.
\code{epub_recombine} collapses the text and then creates a new nested data frame containing new chapter/section labels, word counts and character counts,
associated with the text based on the new break points.

Usefulness depends on the quality of the e-book. While this function exists to improve the data structure of e-book content parsed from e-books with poor metadata formatting,
it still requires original formatting that will at least allow such an operation to be successful, specifically a consistent, non-ambiguous regular expression pattern.
See examples below using the built in e-book dataset.

When used in conjunction with \code{epub_sift} via the \code{sift} argument, recombining and resifting is done recursively.
This is because it is possible that sifting can create a need to rerun the recombine step in order to regenerate proper chapter indexing for the section column.
However, recombining a second time does not lead to a need to resift, so recursion ends after one round regardless.

This is a convenient way to avoid the syntax:

\code{epub_recombine([args]) \%>\% epub_sift([args]) \%>\% epub_recombine([args])}.
}
\examples{
file <- system.file("dracula.epub", package = "epubr")
x <- epub(file) # parse entire e-book
x$data[[1]] # note arbitrary section breaks (not between chapters)

pat <- "CHAPTER [IVX]+" # but a reliable pattern exists for new breaks
epub_recombine(x, pat) # not quite as expected; pattern also appears in table of contents!

epub_recombine(x, pat, sift = list(n = 1000)) # also sift low word-count sections
}
\seealso{
\code{\link{epub_sift}}
}
