Discussion
of issues related to choice of calibrants, mi, and pi
In checking the 2
spectra from each day provided on the
author's web site (which is very nicely organized and helpful), I noted
from
the xml files that while the scans may have been optimized for the
range from
2000 to 20000 Daltons (in the spotProtocolInstructions field), the
calibration
was not suited for this range. This is because the calibrants used (as
supplied
in massCalibrationInfo) are at m/z valueof 12360.2+H, 16951.5+H,
35688+H,
66433+H, and 116351+H, so that the location of m/z 5300 or so shown in
figure 2
must be found by extrapolation as opposed to interpolation. By
contrast, the peaks that the author uses for calibrating the spectra
(taken
from peaks-quadratic.csv) have actual values of 2169.8, 5363.4, 7782.9,
9298.7,
and 13866.9, nicely bracketing the region of interest. I strongly
suspect that
the Ciphergen plots would look much better if these peaks were used.
This is the
case and an
important point. I included on page 7,
column 1 a discussion of the importance in choosing calibrants that
cover and
bracket the range of interest. In our
case we were inexperienced and guided by a Ciphergen representative in
our
choice of calibrants. In retrospect we
should probably have chosen lower range peptides.
We also obtained data (not presented) from the same chips and spots using a high laser setting – in this case the calibrants we selected were more reasonable. Had we used just low range peptides for calibration we would likely get poor results for this high range. The instrument operator says she thinks it would be difficult to calibrate the machine for low and high ranges simultaneously – her impression is that the machine may be relatively precise in one or the other but not both. One way of testing this is to try to mix low-range peptide calibrants with a high-range protein calibrants – it is my understanding that Ciphergen sells them separately. We plan trying this mixture in the future.
This (i.e. the effect of choice of
calibrants) is a major issue, and the author should compare the
behavior of his algorithms with the Ciphergen software when they are
using the
same set of calibrating peaks. I would suggest that this be done with
both sets
discussed above, (a) to see how good things are when the peaks bracket
the
target region, and (b) to see how bad things get when they are far away.
I attempted to address these questions in the following way. The spectra in the paper provided data over the 0 to 100,000 Dalton range. In tables 2 and 3 of the paper I compared unadjusted spectra (calibrated using high range calibrants) with these same spectra aligned using low-range mi and pi. Below are the c.v.s for peaks obtained in the 2,000-20,000 range. The results are all obtained using the PROcess preprocessing and peak-picking tools and reproduce what is in the paper (page 7 column 1).
|
|
Distribution of c.v.s for
2-20KD |
|
|
|
Method |
25% |
50% |
Mean |
75% |
Number of Peaks |
Unadjusted |
37 |
44 |
63 |
61 |
89 |
Cubic spline |
19 |
25 |
26 |
31 |
86 |
Quadratic (ciphergen) |
19 |
25 |
26 |
31 |
86 |
The quadratic results come from adjusting the Ciphergen XML files, exporting the data as CSV files and then using PROcess – as discussed in the text, the results are the same for the cubic spline and quadratic approaches.
Using the same spectra I then chose mi and pi values in the high range – near the range of the calibrants. The approximate locations of the pi (they changed a bit from spectrum to spectrum) were 11.7 KD, 14.0 KD, 33.2 KD, 66.4 KD, and 79.1 KD, relatively close to the calibrants chosen for the instrument (12.3 KD, 17.0 KD, 35.7 KD, 66.4 KD, and 116.4 KD) with the exception of the last calibrant that is outside the data range. Two versions of the quadratic method were conducted, one with b=0 and one with nonzero b.
This table shows the results when looking at peaks in the low range – i.e. all the methods are calibrated/aligned far from the region of interest
|
|
Distribution of c.v.s for
2-20KD |
|
|
|
Method |
25% |
50% |
Mean |
75% |
Number of Peaks |
Unadjusted |
37 |
44 |
63 |
61 |
89 |
Cubic spline |
29 |
36 |
41 |
43 |
84 |
Quadratic (b nonzero) |
39 |
73 |
88 |
113 |
88 |
Quadratic (b = 0) |
25 |
31 |
33 |
38 |
80 |
The table shows the cubic spline is still a significant
improvement over the unadjusted data, but the quadratic method is
terrible –
even worse than the unadjusted results.
Visual inspection of the spectra revealed the extrapolation
errors were
quite substantial. I then processed the
data constraining b=0 and the improvement was remarkable – surpassing
even the
cubic spline (that has linear as opposed to quadratic extrapolation). This supports the Ciphergen rep’s suggestion
that overfitting could occur with non-zero b values if extrapolation is
an
issue. This strengthens the concerns
about extrapolating results outside the calibration/alignment range.
I then examined how these same alignments had adjusted the
higher peaks – those near the calibration and high mi
and pi. The table below
shows the c.v.s in the
10-100 KD range.
|
|
Distribution of c.v.s for
10-100KD |
|
|
|
Method |
25% |
50% |
Mean |
75% |
Number of Peaks |
Unadjusted |
32 |
35 |
36 |
42 |
31 |
Cubic spline |
29 |
33 |
34 |
39 |
36 |
Quadratic (b nonzero) |
30 |
34 |
35 |
40 |
33 |
Quadratic (b = 0) |
29 |
33 |
34 |
38 |
30 |
All the methods perform pretty well. This
supports the reviewer’s suggestion that
had the calibration been appropriate for the low range then the results
in
Tables 2 and 3 would have probably been similar for the unadjusted,
cubic
spline, and quadratic methods.
The lessons I draw from this
exercise are
1) It is critical that
calibration/alignment be performed over the range of interest. If two ranges are of interest then I suspect
some type of alignment procedure will be necessary for at least one of
the
ranges.
2) Fitting the b parameter
likely provides at best only modest improvements and may create
additional
problems if the calibration/alignment region is inappropriate.
This has led to an option of fitting the model in equation 1)
with b=0 -- the webpage
discussing the ciphergen algorithm has R code implementing this
alternative.
3) While it may be true that
good calibration may reduce the need for a separate alignment procedure
when
data are processed on a single machine within one lab over a short
period of
time, there will still likely be a need for such algorithms when data
are
compared across machines, centers, or long periods of time.