% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vcfR2DNAbin.R
\name{vcfR2DNAbin}
\alias{vcfR2DNAbin}
\title{Convert vcfR to DNAbin}
\usage{
vcfR2DNAbin(
  x,
  extract.indels = TRUE,
  consensus = FALSE,
  extract.haps = TRUE,
  unphased_as_NA = TRUE,
  asterisk_as_del = FALSE,
  ref.seq = NULL,
  start.pos = NULL,
  verbose = TRUE
)
}
\arguments{
\item{x}{an object of class chromR or vcfR}

\item{extract.indels}{logical indicating to remove indels (TRUE) or to include them while retaining alignment}

\item{consensus}{logical, indicates whether an IUPAC ambiguity code should be used for diploid heterozygotes}

\item{extract.haps}{logical specifying whether to separate each genotype into alleles based on a delimiting character}

\item{unphased_as_NA}{logical indicating how to handle alleles in unphased genotypes}

\item{asterisk_as_del}{logical indicating that the asterisk allele should be converted to a deletion (TRUE) or NA (FALSE)}

\item{ref.seq}{reference sequence (DNAbin) for the region being converted}

\item{start.pos}{chromosomal position for the start of the ref.seq}

\item{verbose}{logical specifying whether to produce verbose output}
}
\description{
Convert objects of class vcfR to objects of class ape::DNAbin
}
\details{
Objects of class \strong{DNAbin}, from the package ape, store nucleotide sequence information.
Typically, nucleotide sequence information contains all the nucleotides within a region, for example, a gene.
Because most sites are typically invariant, this results in a large amount of redundant data.
This is why files in the vcf format only contain information on variant sites, it results in a smaller file.
Nucleotide sequences can be generated which only contain variant sites.
However, some applications require the invariant sites.
For example, inference of phylogeny based on maximum likelihood or Bayesian methods requires invariant sites.
The function vcfR2DNAbin therefore includes a number of options in attempt to accomodate various scenarios.


The presence of indels (insertions or deletions)in a sequence typically presents a data analysis problem.
Mutation models typically do not accomodate this data well.
For now, the only option is for indels to be omitted from the conversion of vcfR to DNAbin objects.
The option \strong{extract.indels} was included to remind us of this, and to provide a placeholder in case we wish to address this in the future.


The \strong{ploidy} of the samples is inferred from the first non-missing genotype.
All samples and all variants within each sample are assumed to be of the same ploid.


Conversion of \strong{haploid data} is fairly straight forward.
The options \code{consensus} and \code{extract.haps} are not relevant here.
When vcfR2DNAbin encounters missing data in the vcf data (NA) it is coded as an ambiguous nucleotide (n) in the DNAbin object.
When no reference sequence is provided (option \code{ref.seq}), a DNAbin object consisting only of variant sites is created.
When a reference sequence and a starting position are provided the entire sequence, including invariant sites, is returned.
The reference sequence is used as a starting point and variable sitees are added to this.
Because the data in the vcfR object will be using a chromosomal coordinate system, we need to tell the function where on this chromosome the reference sequence begins.


Conversion of \strong{diploid data} presents a number of scenarios.
When the option \code{consensus} is TRUE and \code{extract.haps} is FALSE, each genotype is split into two alleles and the two alleles are converted into their IUPAC ambiguity code.
This results in one sequence for each diploid sample.
This may be an appropriate path when you have unphased data.
Note that functions called downstream of this choice may handle IUPAC ambiguity codes in unexpected manners.
When extract.haps is set to TRUE, each genotype is split into two alleles.
These alleles are inserted into two sequences.
This results in two sequences per diploid sample.
Note that this really only makes sense if you have phased data.
The options ref.seq and start.pos are used as in halpoid data.


When a variant overlaps a deletion it may be encoded by an \strong{asterisk allele (*)}.
The GATK site covers this in a post on \href{https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele}{Spanning or overlapping deletions} ].
This is handled in vcfR by allowing the user to decide how it is handled with the paramenter \code{asterisk_as_del}.
When \code{asterisk_as_del} is TRUE this allele is converted into a deletion ('-').
When \code{asterisk_as_del} is FALSE the asterisk allele is converted to NA.
If \code{extract.indels} is set to FALSE it should override this decision.


Conversion of \strong{polyploid data} is currently not supported.
However, I have made some attempts at accomodating polyploid data.
If you have polyploid data and are interested in giving this a try, feel free.
But be prepared to scrutinize the output to make sure it appears reasonable.


Creation of DNAbin objects from large chromosomal regions may result in objects which occupy large amounts of memory.
If in doubt, begin by subsetting your data and the scale up to ensure you do not run out of memory.
}
\examples{
library(ape)
data(vcfR_test)

# Create an example reference sequence.
nucs <- c('a','c','g','t')
set.seed(9)
myRef <- as.DNAbin(matrix(nucs[round(runif(n=20, min=0.5, max=4.5))], nrow=1))

# Recode the POS data for a smaller example.
set.seed(99)
vcfR_test@fix[,'POS'] <- sort(sample(10:20, size=length(getPOS(vcfR_test))))

# Just vcfR
myDNA <- vcfR2DNAbin(vcfR_test)
seg.sites(myDNA)
image(myDNA)

# ref.seq, no start.pos
myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef)
seg.sites(myDNA)
image(myDNA)

# ref.seq, start.pos = 4.
# Note that this is the same as the previous example but the variants are shifted.
myDNA <- vcfR2DNAbin(vcfR_test, ref.seq = myRef, start.pos = 4)
seg.sites(myDNA)
image(myDNA)

# ref.seq, no start.pos, unphased_as_NA = FALSE
myDNA <- vcfR2DNAbin(vcfR_test, unphased_as_NA = FALSE, ref.seq = myRef)
seg.sites(myDNA)
image(myDNA)



}
\seealso{
\href{https://cran.r-project.org/package=ape}{ape}
}
