Instructions regarding the reading of raw data files

Ovarian 8-7-02 Dataset

  1.  Download from http://ncifdaproteomics.com/ppatterns.php (follow the link for low resolution data) and upzip into a directory -- call it c:\ciphergen.  I put all the csv files (control and patient) in the same directory.  The ciphergen designation indicates these data were produced by a Ciphergen instrument -- the qstar data is produced by a different instrument.
  2. Make sure the names have a common format such as Control daf-0num.csv for the controls and Ovarian Cancer daf-0num for the patients  -- a few chanages were necessary when I did this.
  3. Use the R code with appropriate modifications if necessary.  Because of its UNIX origins, R has some trouble with using "\" as a directory separator.  Files in directories such as "c:\ciphergen\Control daf-0201.csv" should be written as c:/ciphergen/Control daf-0201.csv or c:\\ciphergen\\Control daf-0201.csv in R.
  4. This produces wvec.rda, cancer.rda, and control.rda.  The .rda indicates an R data file.  The last two files should be self-explanatory.  wvec.rda is a vector indicating the common m/z values for the spectra.  See the code for other remarks.

Qstar, high resolution ovarian data

  1. Download and upzip into a directory -- call it c:\qstar.  As before I put all the csv files (control and patient) in the same directory. 
  2. The data here need to be aggregated -- see the NCI-FDA website for a description of how the length of the raw data vectors vary among samples.  Their description of what was done is not very clear.  They say, "It is important to understand that besides analyzing the raw data, we also bin our data at a 400 ppm bin rate that reduces the dimensionality of the data from over 300,000 data points per spectra to fewer than 8000." In the Conrads article (Endocrine-Related Cancer Journal  citation) this seems to mean the bin size = 400/1000000 * m/z, so that the bin size at 700 m/z = .28 and the bin size at 12000 should be 4.80 (rather than the 4.75 they indicate in the paper).  When I use these bin sizes I get 7106 bins rather than the 7086 cited in Conrads.  Here wvec.rda corresponds to the right hand endpoints of these bins so that wvec[1] = 700.28 and wvec[7106] = 12003.  Using this coding the right hand endpoints are within .01 m/z of all the m/z values they indicate in models on the website.  Other attempts were unable to get very close reproduction of these values and were therefore not used.  When I asked for clarification about this at the website no one responded.
  3. The R code establishes these 7106 bins and aggregates the raw data into them.  Because the raw files are so large a FORTRAN (see the file binspot.f) subroutine was used to speed up this process.  Here is a bare bones description of how to use this and other FORTRAN procedures developed for this paper.
  4. One point that may be noteworthy: a spectrum's value in one of the bins corresponds to their average value over all raw values that fell into the bin range, not the sum of raw values within the bin. 

Normalization of the two datasets was performed differently.  For the Ciphgergen data the spectra were rescaled linearly so the smallest value was 0 and the largest was 1.  This was described in an earlier document on the NCI-FDA website (since removed) and was the approach described in Baggerly et al. 2004.  This has no effect on the genetic algorithm since all the rescaling is done within an individual on a chromosome by chromosome basis.   This may have some effect (relative to performing no normaliztion) on the boosting and pam algorithms but it is likely to be quite small since the max for all spectra was 100 (except one which reported a max value of 99.75) and the minima lay between 3.75 and 3.95 -- so the effect was nearly one of the same linear transformation for each spectrum. 

For the qstar data, once the raw values were aggregated they were normalized to have the same average value across the whole sample.  Here it seemed necessary to do something to address the fact that the magnitudes for samples processed later (predominantly cancer) were generally less than those processed earlier -- see the QC document on the NCI-FDA website.  Again, the normalization has no effect for the genetic algorithm.  For the other algorithms it seemed important to try to address this temporal effect.