% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/training_data.R
\name{merge.chunkrange}
\alias{merge.chunkrange}
\title{CRF Training data construction: add chunk entity category to a tokenised dataset}
\usage{
\method{merge}{chunkrange}(x, y, by.x = "doc_id", by.y = "doc_id", default_entity = "O", ...)
}
\arguments{
\item{x}{an object of class \code{chunkrange}. A \code{chunkrange} is just a data.frame which contains 
one row per chunk/doc_id. It should have the columns doc_id, text, chunk_id, chunk_entity, start and end.\cr
The fields \code{start} and \code{end} indicate in the original \code{text} where the chunks of words starts and where it ends. 
The \code{chunk_entity} is a label you have assigned to the chunk (e.g. ORGANISATION / LOCATION / MONEY / LABELXYZ / ...).}

\item{y}{a tokenised data.frame containing one row per doc_id/token It should have the columns \code{doc_id}, \code{start} and \code{end} where
the fields \code{start} and \code{end} indicate the positions in the original text of the \code{doc_id} where the token starts and where it ends. 
See the examples.}

\item{by.x}{a character string of a column of \code{x} which is an identifier which defines the sequence. Defaults to 'doc_id'.}

\item{by.y}{a character string of a column of \code{y} which is an identifier which defines the sequence. Defaults to 'doc_id'.}

\item{default_entity}{character string with the default \code{chunk_entity} to be assigned to the token if the token is not part of any chunk range.
Defaults to 'O'.}

\item{...}{not used}
}
\value{
the data.frame \code{y} where 2 columns are added, namely:
\itemize{
 \item{chunk_entity: The chunk entity of the token if the token is inside the chunk defined in \code{x}. If the token is not part of any chunk, the chunk category will be set to the \code{default} value.}
 \item{chunk_id: The chunk identifier of the chunk for which the token is inside the chunk.}
}
}
\description{
Chunks annotated with the shiny app in this R package indicate for a chunk of text of a document
the entity that it belongs to. As text chunks can contains several words, we need to have a way in
order to add this chunk category to each word of a tokenised dataset. That's what this function is doing.\cr
If you have a tokenised data.frame with one row per token/document which indicates the start and end position
where the token is found in the text of the document, this function allows to assign the chunk label to each token 
of the document.
}
\examples{
\donttest{
library(udpipe)
udmodel <- udpipe_download_model("dutch-lassysmall")
if(packageVersion("udpipe") >= "0.7"){
  data(airbnb_chunks, package = "crfsuite")
  airbnb_chunks <- head(airbnb_chunks, 20)
  airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])

  airbnb_tokens <- udpipe(airbnb, object = udmodel)
  head(airbnb_tokens)
  head(airbnb_chunks)

  ## Add the entity of the chunk to the tokenised dataset
  x <- merge(airbnb_chunks, airbnb_tokens)
  table(x$chunk_entity)
}

## cleanup for CRAN
file.remove(udmodel$file_model)
}
}
