\name{MS.DataCreationCDF}
\alias{MS.DataCreationCDF}

\title{
Same as \code{MS.DataCreation} but with the capability to read AIA/ANDI NetCDF, mzXML, mzData and mzML files.
Create an initial data matrix from GC-MS analyses by collecting and assembling the information from chromatograms and mass spectra from AIA/ANDI NetCDF, mzXML, mzData and mzML files.
Warning: the use of xcms package is necessary. }
\description{
This function constructs an initial data matrix by collecting and assembling the information from chromatograms and mass spectra from several GC-MS analyses. For all input files, peak retention times (or retention indices) are retrieved from the chromatograms (from rteres.txt file) and associated to their respective mass spectrum (from CDF file). Each row of the output data matrix represent one peak in one analysis and give the sample name in first column, the peak retention time (or retention index) in second column and the mass spectrum of the peak in the following columns. If the input file is in Agilent format, it is possible to add quantification information by reporting percent of the total corrected area and corrected area.
}
\usage{
##xcms R package needed
##copy paste this to download xcms.  Remove the comment # signs
##source("http://bioconductor.org/biocLite.R");biocLite("xcms")

MS.DataCreationCDF(path, pathCDF="", mz, apex, quant = FALSE)
}

\arguments{
  \item{path}{
Name of the folder containing all the GC-MS analyses (e.i The Agilent .D folders with at least the rteres.txt file).
}
  \item{pathCDF}{
Name of the folder containing all the CDF files, 

You can write the path \code{(pathCDF="c:/Myfolder/")} or keep this value empty. By default \code{pathCDF=""}.
If \code{pathCDF=""} the function require the \code{tcltk} R package to be installed. If code{pathCDF=""} an interactive window will help you browse your computer for the folder containing all the CDF files.
} 
  \item{mz}{
Range of mass fragments delimiting the mass spectrum, e.g. 30:250
}
  \item{apex}{
\code{TRUE} indicates that the mass spectrum is considered at the apex of the peak and \code{FALSE} indicates that a mean mass spectrum is obtained by averaging 5 percent of the mass spectra surrounding the apex (apex included) for Agilent and by averaging the mass spectrum before, the mass spectrum after and the mass spectrum in the apex for ASCII files
}
  \item{quant}{
The option quant indicates if quantification information should be extracted from rteres.txt and added to the initial data matrix. \code{TRUE} indicates that the two quantification columns corr.area (corrected peak area) and % of total (percent of the total corrected area) are extracted from rteres.txt and added in the initial data matrix after the column retention time (or retention index). Corrected area is used for absolute quantification when associated with the use of external and/or internal standards. Percent of the total corrected area is used for relative quantification (no external or internal standard needed). This choice will allow to generate a profiling matrix with quantification of each molecule after MS.clust. \code{FALSE} indicates that the quantification information should not be added to the initial data matrix. Then, a fingerprinting matrix (absence or presence of each molecule) will be obtained after MS.clust.
}
}
\details{
After a GC-MS analysis with Agilent apparatus, a .D folder is created and contains different files from the chromatograph and from the mass spectrometer. The input files in the sample folder can be of different origins: 
	 
	(i) For Agilent Technologies providers (using the default parameters): each analysis returns a folder .D that contains a file rteres.txt with summary information of the chromatogram. A second file (with information of the mass spectra) is needed and can be generated by the user with the Chemstation dataanalysis software (Menu/File/AIA ANDII...), by default the generated file is in *.CDF and is placed in a user defined folder.   
	  
				The function first checks if all samples folders (.D) within the folder \emph{path} have file rteres.txt . If one file is missing, the analysis stops and indicates the name of the problematic sample. The analysis should be restarted after correction or removal. In a second time, the function ask the path to the folder with AIA/ANDI NetCDF, mzXML, mzData or mzML files by a prompt window and then collects the peak's retention time (or retention index) in rteres.txt and look for corresponding mass spectra in AIA/ANDI NetCDF, mzXML, mzData or mzML from the second directory. Depending on the Apex option, the mean mass spectrum per each peak is calculated or the mass spectrum at the apex is extracted. The intensity, in counts, of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum.  If quant = TRUE, the two quantification columns CorrArea (corrected peak area) and PercTot (percent of the total corrected area) are extracted for each peak from rteres.txt and placed respectively in columns 3 and 4 of the output data matrix.  
				 
	(ii) For other providers: data should be transformed using code{MS.DataCreation} function.
			 
	During the analysis, a temporary file called save_list_temp.rda is automatically generated in folder \emph{path}. 
		 
	The final output file called initial_DATA.txt is saved in folder \emph{Output_MSDataCreation_resultdate_time}.  
	The output data matrix contains the relative mass spectrum of each peak of all samples. The first column contains sample name (the name of the folder containing the GC-MS analysis), the second column is the peak retention time (or retention index) and the following columns correspond to the relative mass spectrum of the peak (within the range of the mass spectrum).  

If quant = TRUE for DataType= Agilent, the first column contains sample name, the second column is the peak retention time (or retention index), the third column contains corrected area (CorrArea), the fourth column contains percent of the total corrected arera (PercTot) and the following columns correspond to the relative mass spectrum of the peak (within the range of the mass spectrum). 
}
\value{
MS.DataCreationCDF returns a data matrix as an object in R and this data matrix, called initial_DATA.txt, is also saved in folder \emph{Output_MSDataCreation_resultdate_time}. It contains one row per peak and per individual with the information in column of the sample name, the retention time (or retention index) and the relative mass spectrum. If quant =TRUE for DataType = Agilent, two supplementary columns corrArea and PercTot are added after the column retention time. 
A temporary list is generated during the process. It allows recovering temporary informations if the function stopped before ending because of errors.

}

\author{
Elodie Courtois, Yann Guitton, Florence Nicole
}


\examples{
\dontrun{ 
##not run 
##require xcms package
##For Agilent GC-MS files (You have to create 2 folders:
## One folder with all the rteres.txt placed in divers .D sub-folders 
## and one folder with all CDF or XML files )
## CDF files have to be downloaded from MSeasy web site 
##  http://sites.google.com/site/rpackagemseasy/downloads/ExempleCDF.zip


## url1<-"http://sites.google.com/site/rpackagemseasy/downloads/ExempleCDF.zip"
## download.file(url=url1, destfile="AgilentCDF.zip")
## unzip(zipfile="AgilentCDF.zip", exdir=".") 
## a folder is created in your current working directory
## unlink("AgilentCDF.zip")  ##delete the zip files

pathAgilent<-system.file("doc/Agilent_MSDataCreation", package="MSeasy")

#with pathCDF
MS.DataCreationCDF(path=pathAgilent, pathCDF=getwd(), mz=30:250,apex=FALSE) 

# without pathCDF
MS.DataCreationCDF(path=pathAgilent, mz=30:250,apex=FALSE) 


## A box appears and ask for the path to the ExempleCDF folder
## downloaded and unziped from MSeasy website


  }
}


