% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/matching.R
\name{auto_match_seqs}
\alias{auto_match_seqs}
\title{Build a template table with automatically matched sequence names}
\usage{
auto_match_seqs(x, method = "lv", xlsx)
}
\arguments{
\item{x}{A table (data frame or tibble) typically produced by
\code{\link{concatipede_prepare}}. It must be of the same format as a
table returned by this function: a first column called "name" followed
by one column per fasta file. Those columns have the name of their
corresponding fasta file, and they contain the names of the sequences in
this file, with one sequence name per cell. The number of rows in the
number of sequences of the fasta file with the most sequences, and the
columns for the other fasta files are filled with \code{NA} for padding.}

\item{method}{Method for string distance calculation. See
\code{?stringdist::stringdist-metrics} for details. Default is
\code{"lv"}.}

\item{xlsx}{Optional, a path to use to save the output table as an Excel
file.}
}
\value{
A table (tibble) with the same columns as \code{x} and with sequence
names automatically matched across fasta files. Sequence names which did
not have a best reciprocal match in other fasta files are appended to
the end of the table, so that the output table columns contain all the
unique sequence names present in the corresponding column of the input
table. The first column, "name", contains a suggested name for the row
(not guaranteed to be unique). If a path was provided to the \code{xlsx}
argument, an Excel file is saved and the table is returned invisibly.
}
\description{
The algorithm used to match sequences across fasta files based on their
names is outlined below.
}
\details{
Let's assume a situation with N fasta files, with each fasta file i having
n_i sequence names. The problem of matching the names in the best possible
way across the fasta files is similar to that of identifying homologous
proteins across species, using e.g. reciprocal blast.

The algorithm steps are:
\itemize{

\item For each pair of fasta files, identify matching names using a
reciprocal match approach: two names match if and only if they are their
reciprocal best match.

\item Those matches across fasta files define a graph.

\item We identify sub-graphs such that (i) they contain at most one
sequence name per fasta file and (ii) all nodes in a given sub-graph are
fully connected (i.e., they are all their best reciprocal matches across
any pair of fasta files).

}
}
\examples{
xlsx_file <- concatipede_example("sequences-test-matching.xlsx")
xlsx_template <- readxl::read_xlsx(xlsx_file)
auto_match_seqs(xlsx_template)
\dontrun{
  auto_match_seqs(xlsx_template, xlsx = "my-automatic-output.xlsx")
}

}
