Instructions regarding
the reading of raw data files
Ovarian 8-7-02 Dataset
- Download from http://ncifdaproteomics.com/ppatterns.php
(follow the link for low resolution data) and upzip into a directory --
call it c:\ciphergen. I put all the csv files (control and
patient) in the same directory. The ciphergen designation
indicates these data were produced by a Ciphergen instrument -- the
qstar data is produced by a different instrument.
- Make sure the names have a common format such as
Control daf-0num.csv for the controls and Ovarian Cancer daf-0num for
the patients -- a few chanages were necessary when I did this.
- Use the R code
with appropriate modifications if necessary. Because of its UNIX
origins, R has some trouble with using "\" as a directory
separator. Files in directories such as "c:\ciphergen\Control
daf-0201.csv" should be written as c:/ciphergen/Control daf-0201.csv or
c:\\ciphergen\\Control daf-0201.csv in R.
- This produces wvec.rda, cancer.rda, and
control.rda. The .rda indicates an R data file. The last
two files should be self-explanatory. wvec.rda is a vector
indicating the common m/z
values for the spectra. See the code for other remarks.
Qstar, high resolution ovarian data
- Download and upzip into a directory --
call it c:\qstar. As before I put all the csv files (control and
patient) in
the same directory.
- The data here need to be aggregated -- see the
NCI-FDA website for a description of how the length of the raw data
vectors vary among samples. Their description of what was done is
not very clear. They say, "It is important to understand
that besides analyzing the raw data, we
also bin our data at a 400 ppm bin rate that reduces the dimensionality
of the data from over 300,000 data points per spectra to fewer than
8000." In the Conrads article
(Endocrine-Related Cancer Journal citation) this seems to mean
the bin size = 400/1000000 * m/z,
so that the bin size at 700 m/z =
.28 and the bin size at 12000 should be 4.80 (rather than the 4.75 they
indicate in the paper). When I use these bin sizes I get 7106
bins rather than the 7086 cited in Conrads. Here wvec.rda
corresponds to the right hand endpoints of these bins so that wvec[1] =
700.28 and wvec[7106] = 12003. Using this coding the right hand
endpoints are within .01 m/z
of all the m/z values they
indicate in models on the website. Other attempts were unable to
get very close reproduction of these values and were therefore not
used. When I asked for clarification about this at the website no
one responded.
- The R code
establishes these 7106 bins and aggregates the raw data into
them. Because the raw files are so large a FORTRAN (see the file binspot.f) subroutine was used to speed up this
process. Here is a bare bones
description of how to use this and other FORTRAN procedures developed
for
this paper.
- One point that may be noteworthy:
a spectrum's value in one of the bins corresponds to their
average value over all raw values that fell into the bin range, not the
sum of raw values within the bin.
Normalization of the two datasets was performed
differently. For the Ciphgergen data the spectra were rescaled
linearly so the smallest value was 0 and the largest was 1. This
was described in an earlier document on the NCI-FDA website (since
removed) and was the approach described in Baggerly et al. 2004.
This has no effect on the genetic algorithm since all the rescaling is
done within an individual on a chromosome by chromosome
basis. This may have some effect (relative to performing no
normaliztion) on the boosting and pam algorithms but it is likely to be
quite small since the max for all spectra was 100 (except one which
reported a max value of 99.75) and the minima lay between 3.75 and 3.95
-- so the effect was nearly one of the same linear transformation for
each spectrum.
For the qstar data, once the raw values were aggregated they were
normalized to have the same average value across the whole
sample. Here it seemed necessary to do something to address the
fact that the magnitudes for samples processed later (predominantly
cancer) were generally less than those processed earlier -- see the QC
document on the NCI-FDA website. Again, the normalization has no
effect for the genetic algorithm. For the other algorithms it
seemed important to try to address this temporal effect.