\name{g1_part}
\alias{g1_part}
\docType{data}
\title{ a part of g1.csv }
\description{
 \code{g1.csv} in folder \code{data_top100} (In Version 1.2, data_top100 has been moved to \cr
 \code{http://www.stanford.edu/~junli07/research.html} to reduce the total size of this package.)
 stores the counts and sequences of the top 100 genes in the Grimmond EB data.
 This file only stores a part of \code{g1.csv}--the top 10 genes.
 The reason we only keep a small part is to shorten the calculation time of the example codes.
 The full top 100 genes from each datasets are provides as seperate files in the folder \code{data_top100}. Please read \code{Readme_format.txt} for details about the data and the required format.
 This data can be generated by \cr \code{
 g1 <- read.csv("g1.csv");
 g1_part <- g1[g1$index < 11,]
}
}
\usage{data(g1_part)}
\format{
  A data frame with 8307 observations on the following 4 variables.
  \describe{
    \item{\code{index}}{a numeric vector}
    \item{\code{tag}}{a numeric vector}
    \item{\code{seq}}{a factor with levels \code{T} \code{A} \code{C} \code{G}}
    \item{\code{count}}{a numeric vector}
  }
}
\details{
 \code{index} is an index for the gene from where this count comes.\cr
 \code{tag} is an integer value, \code{0} means to consider this count, any other value means this count should not be taken into account. In our files, \code{-2} means the UTR part, and \code{-1} means the further 100 bp. The user can use any integer other than \code{0} to denote the discarded counts.\cr
 \code{seq} is the nucleotide of this position. Must be capital \code{A} \code{C} \code{G} \code{T}. No other characters accepted. No little characters accepted. No missing values accepted. If the number of missing values is small, you can use \code{T} (or \code{A} \code{G} \code{T}) for them; this should not change the result significantly.\cr
 \code{count} is the count of reads starting at this position.\cr\cr
 For each gene (or each group of positions that have the same level of expression, like exon or isoform), a distinguished index should be used. Each gene (or group) may include positions in both strand (like data generated by Illumina) or single strand (like data generated by ABi).
 Within each gene (or group), the positions should be in the 5 prime to 3 prime order for each strand. There should be no gaps or missing values.
 So actually, for each gene in Illumina outputs, the data are comprised of two halves. The first half are the data from the forward strand, and the second half are the data from the second strand.
 For each gene in ABi outputs, there are no such two halves.\cr\cr
 For each gene or each half, the nucleotides retained for analysis should be surrounded with long-enough nonretained nucleotides.
 For example, if you want to consider left 40 bp and right 40 bp as surrounding sequences, then there should be at least 40 bp in both sides of nucleotides retained.\cr\cr
 Right formats are very important; otherwise, the program may give unpredictable results.
 This package itself will not justify the correctness of the format. Please make sure you have done it.
}
\references{
 Li J, Jiang H, Wong WH, Modeling non-uniformity in short-read rates in RNA-Seq data, submitted.
}
\keyword{datasets}
