Selection of Peak
Masses (pi) and Corresponding Targets (mi)
Overview of approach
For each spectrum requiring realignment, both algorithms need a
relatively small set of peaks with m/z
values p1,...,pN and
associated target m/z values m1,...,mN. Here is one
approach for generating these corresponding lists in an automated
fashion.
This approach begins with a reference spectrum -- a fixed spectrum to
which all other spectra will be compared and aligned. This may be
obtained by choosing one of the spectra or creating a spectrum that is
the average of all the spectra. An example of a reference
spectrum is here. A spectrum that requires
alignment to
this reference will be referred to as a test spectrum (example). To begin,
first define a subrange of the mass values -- in this case perhaps
2,000 to 3,000 Daltons. For this range locate the largest peak in the
test spectrum and note its location, denoted as p1. Then consider
all the peaks in the reference spectrum that are located within a
fixed window around p1,
say [.98* p1,1.02*
p1]. If
there are k peaks within this 2% window they are denoted as
m11, m21,
..., mk1. For each of these peaks
consider
a
window (of 5% width on each side) centered at mj1such as
[.95*mj1,1.05*mj1] for j = 1,..., k. For each one
of these windows, [.95*mj1,1.05*mj1] compute the
correlation coefficient of the intensities in the reference spectrum
over this mass range with the intensities over a 5% window centered
about p1, [.95*p1,1.05*p1] in the test
spectrum.
From these k correlation coefficients choose the corresponding
target peak as that with the highest correlation. This procedure may
then be repeated using different m/z
ranges. For each subrange one
obtains a mass corresponding to the highest peak in the test spectrum
and a mass corresponding to the peak in the reference spectrum that
shows best correlation. If a set of peaks are thought to be
represented in nearly all spectra the ranges may be set to focus upon
them. Alternatively, a set of subranges that partitions the range of
interest may be chosen, e.g. 2,000 - 4,000, 4,000 - 7,000, 7,000 -
10,000, 10,000 - 13,000, and 13,000 - 20,000.
Here is R code that implements the
algorithm (find-matches.r). A user may use this program by
running it in the same directory that holds the test and reference
samples. The output of the file looks like:
Reference |
Misaligned |
Correlation |
PctDiff |
2926.49 |
2918.50 |
0.91 |
0.002738 |
5871.29 |
5855.14 |
0.94 |
0.002759 |
7727.91 |
7708.46 |
0.99 |
0.002524 |
10218.87 |
10196.50 |
0.97 |
0.002194 |
13989.55 |
13959.64 |
0.89 |
0.002142 |
The first two columns show the paired mass values for the peaks --
reference values correspond to the reference file and misaligned refers
to the test spectrum. Correlation shows the correlation over
windows of 5% width centered over the two m/z values in the pair.
PctDiff shows how large is the difference between the two m/z values, i.e. (reference m/z - test m/z)/test m/z. Low correlations
deserve more scrutiny and the R code provides graphs when run
interactively in a line-by-line mode (these graphs are saved as a
postscript, .ps, file when run in batch) -- in particular one may see
the
correlations of the second and third best peaks and graphs of the
relevant regions. What is considered a low correlation may vary
by project and samples (e.g. all from the same pool to determine
reproducibility vs. samples from distinct biological entities) but I
would investigate below .75 or .70 for samples from a common pool and
below .65 or .60 for distinct biological entities. Also, from the
PctDiff column we can get a sense for if a trend exists and whether a
single point shows a strong departure from an overall pattern. If
the fourth row had a PctDiff of -.005 and the other values remained the
same then I would wonder if the wrong peak was chosen and investigate
further.
Here is a link to
a view all the files in this directory.