Selecting peak masses and corresponding targets

Selection of Peak Masses (p_i) and Corresponding Targets (m_i)

Overview of approach
For each spectrum requiring realignment, both algorithms need a relatively small set of peaks with m/z values p₁,...,p_N and associated target m/z values m₁,...,m_N. Here is one approach for generating these corresponding lists in an automated fashion.

This approach begins with a reference spectrum -- a fixed spectrum to which all other spectra will be compared and aligned. This may be obtained by choosing one of the spectra or creating a spectrum that is the average of all the spectra. An example of a reference spectrum is here. A spectrum that requires alignment to this reference will be referred to as a test spectrum (example). To begin, first define a subrange of the mass values -- in this case perhaps 2,000 to 3,000 Daltons. For this range locate the largest peak in the test spectrum and note its location, denoted as p₁. Then consider all the peaks in the reference spectrum that are located within a fixed window around p₁, say [.98* p₁,1.02* p₁]. If there are k peaks within this 2% window they are denoted as m₁¹, m₂¹, ..., m_k¹. For each of these peaks consider a window (of 5% width on each side) centered at m_j¹such as [.95*m_j¹,1.05*m_j¹] for j = 1,..., k. For each one of these windows, [.95*m_j¹,1.05*m_j¹] compute the correlation coefficient of the intensities in the reference spectrum over this mass range with the intensities over a 5% window centered about p₁, [.95*p₁,1.05*p₁] in the test spectrum. From these k correlation coefficients choose the corresponding target peak as that with the highest correlation. This procedure may then be repeated using different m/z ranges. For each subrange one obtains a mass corresponding to the highest peak in the test spectrum and a mass corresponding to the peak in the reference spectrum that shows best correlation. If a set of peaks are thought to be represented in nearly all spectra the ranges may be set to focus upon them. Alternatively, a set of subranges that partitions the range of interest may be chosen, e.g. 2,000 - 4,000, 4,000 - 7,000, 7,000 - 10,000, 10,000 - 13,000, and 13,000 - 20,000.

Here is R code that implements the algorithm (find-matches.r). A user may use this program by running it in the same directory that holds the test and reference samples. The output of the file looks like:

Reference	Misaligned	Correlation	PctDiff
2926.49	2918.50	0.91	0.002738
5871.29	5855.14	0.94	0.002759
7727.91	7708.46	0.99	0.002524
10218.87	10196.50	0.97	0.002194
13989.55	13959.64	0.89	0.002142

The first two columns show the paired mass values for the peaks -- reference values correspond to the reference file and misaligned refers to the test spectrum. Correlation shows the correlation over windows of 5% width centered over the two m/z values in the pair. PctDiff shows how large is the difference between the two m/z values, i.e. (reference m/z - test m/z)/test m/z. Low correlations deserve more scrutiny and the R code provides graphs when run interactively in a line-by-line mode (these graphs are saved as a postscript, .ps, file when run in batch) -- in particular one may see the correlations of the second and third best peaks and graphs of the relevant regions. What is considered a low correlation may vary by project and samples (e.g. all from the same pool to determine reproducibility vs. samples from distinct biological entities) but I would investigate below .75 or .70 for samples from a common pool and below .65 or .60 for distinct biological entities. Also, from the PctDiff column we can get a sense for if a trend exists and whether a single point shows a strong departure from an overall pattern. If the fourth row had a PctDiff of -.005 and the other values remained the same then I would wonder if the wrong peak was chosen and investigate further.

Here is a link to a view all the files in this directory.