% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/data2haplohh.R
\name{data2haplohh}
\alias{data2haplohh}
\title{Convert data from input file to an object of class haplohh}
\usage{
data2haplohh(hap_file, map_file = NA, min_perc_geno.hap = NA,
  min_perc_geno.mrk = 100, min_maf = NA, chr.name = NA,
  popsel = NA, recode.allele = FALSE, allele_coding = "12",
  haplotype.in.columns = FALSE, remove_multiple_markers = FALSE,
  polarize_vcf = TRUE, capitalize_AA = TRUE,
  position_scaling_factor = NA, verbose = TRUE)
}
\arguments{
\item{hap_file}{file containing haplotype data (see details below).}

\item{map_file}{file containing map information (see details below).}

\item{min_perc_geno.hap}{threshold on percentage of missing data for haplotypes
(haplotypes with less than \code{min_perc_geno.hap} percent of markers genotyped are discarded). Default is \code{NA},
hence no constraint.}

\item{min_perc_geno.mrk}{threshold on percentage of missing data for markers (markers genotyped on less than
\code{min_perc_geno.mrk} percent of haplotypes are discarded). By default, \code{min_perc_geno.mrk=100},
hence only fully genotyped markers are retained.
This value cannot be set to \code{NA} or zero.}

\item{min_maf}{threshold on the Minor Allele Frequency. Markers having a MAF lower than or equal to minmaf are discarded.
In case of multi-allelic markers the second-most frequent allele is referred to as minor allele.
Setting this value to zero eliminates monomorphic sites. Default is \code{NA},
hence no constraint.}

\item{chr.name}{name of the chromosome considered (relevant if data for several chromosomes is
contained in the haplotype or map file).}

\item{popsel}{code of the population considered (relevant for fastPHASE output which
can contain haplotypes from various populations).}

\item{recode.allele}{*Deprecated*. logical. \code{FALSE} by default. \code{TRUE} forces parameter \code{allele_coding} to \code{"map"},
\code{FALSE} leaves it unchanged.}

\item{allele_coding}{the allele coding provided by the user. Either \code{"12"} (default), \code{"01"}, \code{"map"} or \code{"none"}.
The option is irrelevant for vcf files and ms output.}

\item{haplotype.in.columns}{logical. If \code{TRUE}, phased input haplotypes are assumed to be in columns (as produced
by the SHAPEIT2 program (O'Connell et al., 2014).}

\item{remove_multiple_markers}{logical. If \code{FALSE} (default), conversion
stops, if multiple markers with the same chromosomal position are encountered. 
If \code{TRUE}, duplicated markers are removed (all but the first marker with identical positions).}

\item{polarize_vcf}{logical. Only of relevance for vcf files. If \code{TRUE} (default), tries to polarize
variants with help of the AA entry in the INFO field. Unpolarized alleles are discarded. 
If \code{FALSE}, allele coding of vcf file is used unchanged as internal coding.}

\item{capitalize_AA}{logical. Only of relevance for vcf files with ancestral allele information.
Low confidence ancestral alleles are usually coded by lower-case letters. If \code{TRUE} (default), these are
changed to upper case before the alleles of the sample are matched for polarization.}

\item{position_scaling_factor}{intended primarily for output of ms where
positions lie in the interval [0,1]. These can be rescaled to sizes
of typical markers in real data.}

\item{verbose}{logical. If \code{TRUE} (default), report verbose progress.}
}
\value{
The returned value is an object of \code{\link{haplohh-class}}.
}
\description{
Convert input data files to an object of \code{\link{haplohh-class}}.
}
\details{
Five haplotype input formats are supported:
\itemize{
\item a "standard format" with haplotypes in rows and markers in columns (with no header, but a haplotype ID/name in
the first column).
\item a "transposed format" similar to the one produced by the phasing program SHAPEIT2
(O'Connell et al., 2014) in which haplotypes are in columns and markers in rows
(with neither header nor marker IDs nor haplotype IDs).
\item output files from the fastPHASE program (Sheet and Stephens, 2006).
If haplotypes from several different population were phased simultaneously (-u fastPHASE option
was used), it is necessary to specify the population of interest by parameter \code{popsel}
(if this parameter is not or wrongly set, the error message will provide a list of
the population numbers contained in the file).
\item files in variant call format (vcf). No mapfile is needed is this case. If
the file contains several chromosomes, it is necessary to  choose one by parameter
\code{chr.name}.
\item output of the simulation program 'ms'. No mapfile is needed in this case. If the file
contains several 'runs', a specific number has to be specified by the
parameter \code{chr.name}.
}
The "transposed format" has to be explicitly set while the other formats
are recognized automatically.

The map file contains marker information in three, or, if it is used for
polarization (see below), five columns:
\itemize{
\item marker name/id
\item chromosome
\item position (physical or genetic)
\item ancestral allele encoding
\item derived allele encoding
}
The markers must be in the same order as in the haplotype file. If
several chromosomes are represented in the map file, it is necessary to choose that
which corresponds to the haplotype file by parameter \code{chr.name}.

Haplotypes can be given either with alleles already coded as numbers (in two possible ways)
or with the actual alleles (e.g. nucleotides) which can be translated into numbers
either using the fourth and fifth column of the map file or by their alpha-numeric order.
Correspondingly, the parameter \code{allele_coding} has to be set to either \code{"12"},
\code{"01"}, \code{"map"} or \code{"none"}:
\itemize{
\item \code{"12"}: 0 represents missing values, 1 the ancestral allele
and 2 (or higher integers) derived allele(s).
\item \code{"01"}: \code{NA} or '.' (a point) represent missing values, 0 the
ancestral and 1 (or higher integers) derived allele(s).
\item \code{"map"}: for each marker, the fourth column of the map file
defines the ancestral allele and the fifth column derived alleles.
In case of multiple derived alleles, they must be separated by commas without space.
Alleles in the haplotype file which do not appear in neither of the two columns
of the map file are regarded as missing values (\code{NA}).
\item \code{"none"}: \code{NA} or '.' (a point) represent missing values, otherwise for each
marker the allele that comes first in alpha-numeric
order is coded by 0, the next by 1, etc. Evidently, this coding does not convey
any information about allele status as ancestral or derived, hence the alleles
cannot be regarded as polarized.
}
The information of allelic ancestry is exploited only in the frequency-bin-wise
standardization of iHS (see \code{\link{ihh2ihs}}). However, although ancestry status does
not figure in the formulas of the cross populations statistics
Rsb and XP-EHH, their values do depend on the assigned status.

The arguments \code{min_perc_geno.hap},
\code{min_perc_geno.mrk} and \code{min_maf} are evaluated in this order.
}
\examples{
#copy example files into the current working directory.
make.example.files()
#create object using a haplotype file in "standard format"
hap <- data2haplohh(hap_file = "bta12_cgu.hap",
                   map_file = "map.inp",
                   chr.name = 12,
                   allele_coding = "map")
#create object using fastPHASE output
hap <- data2haplohh(hap_file = "bta12_hapguess_switch.out",
                   map_file = "map.inp",
                   chr.name = 12,
                   popsel = 7,
                   allele_coding = "map")
#clean up demo files
remove.example.files()                    
}
\references{
Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype
data: applications to inferring missing genotypes and haplotypic phase. \emph{Am J Hum Genet}, \strong{78}, 629-644.

O'Connell J, Gurdasani D, Delaneau O, et al (2014) A general approach for haplotype phasing
across the full spectrum of relatedness. \emph{PLoS Genet}, \strong{10}, e1004234.
}
