[Lecture Notes in Computer Science] Advances in Mass Data Analysis of Images and Signals in...

OplAnalyzer: A Toolbox for MALDI-TOF MassSpectrometry Data Analysis

Thang V. Pham and Connie R. Jimenez

OncoProteomics Laboratory, Cancer Center Amsterdam,VU University Medical Center

De Boelelaan 1117, 1081 HV Amsterdam, The Netherlands{t.pham,c.jimenez}@vumc.nl

http://www.oncoproteomics.nl/

Abstract. We present a software package for the analysis of MALDI-TOF mass spectrometry data. The software is designed to facilitate acomplete exploratory workflow: pre-processing of raw spectral data, spec-ification of study groups for comparison, statistical differential analysis,visualization of peptide peaks, and classification. The software supportsvarious external tools for these tasks. We also pay special attention to theiterative nature of a typical analysis. Finally, we present two proteomicsstudies where the software has been used for data analysis.

Keywords: data analysis, differential analysis, bio-marker discovery,MALDI-TOF, mass spectrometry, OplAnalyzer, proteomics.

1 Introduction

Mass spectrometry is an attractive method in proteomics research because of itsability to identify and quantify a large number of proteins in complex biologicalsamples [1]. However, the pre-processing and analysis of mass spectrometry dataare fast becoming a bottle neck in the discovery process. This paper describesa software platform developed in our laboratory called OplAnalyzer, which sup-ports proteomics mass spectrometry data pre-preprocessing and analysis. Specifi-cally, we deal with MALDI-TOF mass spectrometry, a standard high throughputplatform that can potentially be used for various diagnostic purposes.

There are a number of tasks involved in a typical analysis: pre-processingof raw spectral data, specification of study groups for comparison, statisticaldifferential analysis, visualization of peptide peaks, and classification [2]. Insteadof integrating all these components into a single tool for a complete analysis, wedevelop a flexible platform where various existing tools for different tasks areaccommodated. Our design also supports the interactive nature of the analysisprocess.

Currently, the software supports the analysis of MALDI-TOF MS-1 data only.Tools for the analysis of MS/MS data with protein identification as well as datafrom another mass spectrometry platform namely LC-FTMS are under activedevelopment.

P. Perner and O. Salvetti (Eds.): MDA 2008, LNAI 5108, pp. 73–81, 2008.c© Springer-Verlag Berlin Heidelberg 2008

74 T.V. Pham and C.R. Jimenez

Classification

a. Data pre−processing

b. Sample grouping

c. Exploratory analysis

Differential analysis

Visualization

d. Batch processing

Fig. 1. An analysis workflow

The analysis workflow and the system are described in Section 2. In section 3we present two proteomics studies where the software has been employed fordata analysis.

2 The System

Fig. 1 shows a typical workflow in proteomics mass spectrometry data analy-sis. The four main steps are: data pre-processing, sample grouping, exploratoryanalysis, and batch processing.

2.1 Data Pre-processing

The data pre-processing step includes the preparation of metadata and the pro-cessing of raw mass spectrometry signals which consists of peak detection, align-ment, normalization, and deisotoping. To facilitate the use of existing tools wedefine a common data format between this step and the subsequent steps, whichis simply based on tab-separated texts.

For our instrument, a 4800 MALDI-TOF/TOF mass spectrometer (AppliedBiosystems, Foster City, USA), we found that the MarkerView software (AppliedBiosystems) works well for data produced in the reflectron mode.

For data produced in the linear mode we have implemented a new method.To detect peaks in an individual spectrum, we search for locations of maximalvalue within a local m/z window. The size of the window is 11 discrete samplingpoints. This method is similar to the peak detection method employed in [4].

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis 75

A

B

dMean spectrum and common peak

Individual spectrum and peak

m/z

pM

pI

Fig. 2. Peak alignment. For each common peak pM in the mean spectrum, the closestpeak pI in each individual spectrum is located. If the distance d between the two peaksis less than

√5, the value at point A is registered for the common peak pM in this

particular spectrum. Otherwise, the value at B is registered.

To find peaks that are common in all spectra, we apply peak detection tothe mean spectra, analogously to [5]. Subsequently, peaks in an individual spec-trum are aligned to this set of common peaks as follows. For each common peak,its value in an individual spectrum is that of the closest detected peak in thatspectrum if the distance between the common peak and the closest peak (inthe m/z axis) is less than

√5 Da. (A better choice is likely to be based on the

actual mass accuracy of the measurement and on the m/z value.) If there isno such peak, the value is simply assigned to the value of the spectrum at them/z location of the common peak. Figure 2 illustrates the procedure. By visualinspection, we found that the quality of our alignment method is comparable tothat of the more computationally expensive clustering method in [4] (data notshown).

2.2 Sample Grouping

Typically, researchers are interested in several comparisons in each experiment,for examples, comparisons based on gender, age, and clinical outcomes. Also,in an interactive analysis the user might want to modify the sample groups forinstance to include or exclude certain samples. To enable an efficient samplegrouping, we define a text-based sample selection based on metadata. The strat-egy is easy to use and particularly suited for batch processing. For example,to specify two groups “Healthy” consisting of samples from healthy individualsand “Cancer” consisting of samples from cancer patients before treatment, theselection is as follows.

Healthy:Cancer-type=Healthy;Cancer:Cancer-type=NSCLC,Time=PreTx


Fig. 3. A screenshot of the output of the statistical testing module

2.3 Exploratory Analysis

For data analysis we exploit existing tools in Matlab (The MathWorks, Inc).A typical first step is unsupervised analysis with principle component analysis(PCA) using all peptide intensities. Here all data points are projected onto atwo or three-dimensional space for visualization. The projection does not useany information of group labels. The purpose is two-fold. First, one can observeif the data are clustered in a low dimensional space according to group labels.Second, one can detect possible outliers or unusual pattern in the data by visualinspection.

For differential analysis, we provide interfaces for the t-test, Mann-Whitney Utest, Kruskal-Wallis test. The p-values can be adjusted for multiple testing. Thepeptides are further subjected to intensity filtering, requiring that the medianintensity of at least one group must be greater than 80 units and the fold changeof the median intensities of the two groups must be greater than 1.5. (Thenumbers can be tuned for each study). Fig. 3 depicts a screenshot of the resultof a comparative study.

The candidate peaks are examined visually by spectra overlay. Again, we usethe visualization capability of Matlab for this purpose.

Finally, we provide classification model selection with support vector machine[3]. A grid search method is used to find the optimal parameter values. For eachvalue in the grid, the generalization error is estimated by either leave-one-outcross validation or repeatedly splitting the data into two partitions randomly,one for training and one for testing. The grid point with lowest estimated gen-eralization error is selected as our model for classification.


2.4 Batch Processing

We consider batch processing an important step in data analysis, especially withregard to reproducibility of figures and other results. In addition, batch process-ing helps produce a large number of figures of peptide peaks in a convenientformat for visual examination. Again, we make use of the scripting capability ofMatlab for this purpose.

3 Examples

In the following, we describe two studies where the current software has beenemployed for data analysis.

3.1 Time-Course MALDI-TOF-MS Serum Peptide Profiling ofNon-small Cell Lung Cancer Patients Treated with Bortezomib,Cisplatin and Gemcitabine

This study performs serum peptide profiling of non-small cell lung cancer(NSCLC) patients treated with gemcitabine, cisplatin and bortezomib combi-nations before, during, and at end of treatment to discover peptide patternsassociated with treatment-related effects and clinical outcomes [7].

Fig. 4 shows a three-dimensional PCA plot of serum peptide spectra of 13healthy individuals and the pre-treatment serum spectra of 27 NSCLC patients.

Fig. 4. Principle component analysis (PCA) of healthy versus NSCLC comparison


(a) (b)

Fig. 5. (a) Spectra overlay of the eight most differential peaks in the healthy (red)versus NSCLC (blue) comparison according to p-values of the Mann-Whitney U test.All peaks have a p-value less than 0.0001. (b) Heatmap of the 47 differential peaks inthe healthy versus NSCLC comparison shown in the natural log scale. The peaks areordered by median fold change between the two groups.

Here, the MarkerView software was used for preprocessing, resulting in 682 pep-tide peaks per raw spectrum.

The Mann-Whitney U test is carried out on each of the 682 peptides, resultingin 47 differential peptides. Fig. 5(a) shows the spectra overlay of the eight mostdifferential peaks in the healthy versus NSCLC comparison. Fig. 5(b) shows aheatmap of the 47 differential peaks.

We carried out classification analysis using support vector machine. A gridsearch for parameters was employed to find the best model according to leave-one-out cross validation (LOOCV). Using all 682 peptides, a LOOCV accuracyof 93% was achieved. When the 47 peptides selected by the Mann-Whitney Utest were used, the LOOCV accuracy was 98% with 100% sensitivity and 96%specificity.

The software has also been used for a large number of other comparisons suchas gender, age, short and long progression free survival, and clinical treatmentresponses.


4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 50000.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

m/z

inte

nsity

(tr

anfo

rmed

val

ue)

Fig. 6. Mean spectrum and detected peaks in the 4000-5000 Da range

3.2 Breast Cancer Study with Maldi-TOF Mass Spectrometry Dataof Serum Samples

This study is part of the international competition on mass spectrometry pro-teomic diagnosis [8][9]. The dataset consists of 153 mass spectra of blood samplesdrawn from control individuals and patients with breast cancers. The aim is toconstruct a classification rule separating the two groups with a low generalizationerror.

For this dataset, the baseline correction had been performed by the competi-tion organizer. We used the software to perform further pre-processing: peak de-tection and alignment. Fig. 6 shows an example of the result of the pre-procesingalgorithm.

Again, a Mann-Whitney U test was performed to select features discrimi-nating the two classes significantly. Furthermore, the Benjamini-Hochberg falsediscovery rate correction [6] was employed to correct for multiple testing. Thisresults in on average 117 peaks with a false discovery rate less than 1%. Fig. 7shows the distribution of the values of the 16 most discriminative peaks.

We employed grid search with exponential spacing to find the optimal valuesfor support vector machine model selection. The generalization error is estimatedby averaging over 200 runs of randomly splitting the given data into two parti-tions, where the size of the test set is roughly a tenth of size of the whole dataset.The feature selection was performed for each random splitting procedure, so thatfair estimates of classification accuracy were obtained. The final accuracy on aseparate validation set of 78 samples is 83%.


50 100 1500.605

0.61

0.615

0.62

0.625

0.63

0.635

0.64

m/z = 1029.374250 100 150

0.65

0.7

0.75

0.8

m/z = 1030.657950 100 150

0.6

0.62

0.64

0.66

0.68

0.7

0.72

m/z = 1028.946450 100 150

0.65

0.7

0.75

0.8

m/z = 1102.0537

50 100 1500.604

0.606

0.608

0.61

0.612

0.614

0.616

0.618

m/z = 1021.262350 100 150

0.7

0.8

0.9

1

1.1

m/z = 980.766750 100 150

0.602

0.604

0.606

0.608

0.61

0.612

0.614

m/z = 1074.344350 100 150

0.6

0.61

0.62

0.63

0.64

m/z = 1076.0933

50 100 1500.6

0.62

0.64

0.66

0.68

0.7

m/z = 1022.114750 100 150

0.65

0.7

0.75

0.8

0.85

0.9

0.95

m/z = 1076.530750 100 150

0.6

0.61

0.62

0.63

0.64

0.65

0.66

m/z = 1056.067750 100 150

0.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

m/z = 1017.0058

50 100 1500.6

0.605

0.61

0.615

0.62

0.625

0.63

m/z = 1059.970950 100 150

0.6

0.605

0.61

0.615

0.62

m/z = 1022.54150 100 150

0.6

0.62

0.64

0.66

0.68

m/z = 977.012250 100 150

0.6

0.61

0.62

0.63

0.64

0.65

0.66

m/z = 991.2335

Fig. 7. Top 16 differential peaks


4 Summary

The paper has introduced a software toolbox for the pre-processing and statisti-cal analysis of MALDI-TOF mass spectrometry data. Our current developmentfocuses on the support for the analysis of MS/MS data with protein identificationand data from another mass spectrometry platform namely LC-FTMS.

References

1. Jimenez, C.R., Piersma, S., Pham, T.V.: High-throughput and targeted in-depthmass spectrometry-based approaches for biofluid profiling and biomarker discovery.Biomarkers in Medicine 1(4), 541–565 (2007)

2. Villanueva, J., Martorella, A.J., Lawlor, K., Philip, J., Fleisher, M., Robbins, R.J.,Tempst, P.: Serum peptidome patterns that distinguish metastatic thyroid carci-noma from cancer-free controls are unbiased by gender and age. Mol. Cell Pro-teomics 5, 1840–1852 (2006)

3. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1999)4. Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G., Koong, A., Le, Q.-

T.: Sample classification from protein mass spectroscopy, by “peak probability con-trasts”. Bioinformatics 20(17), 3034–3044 (2004)

5. Karpievitch, Y.V., Hill, E.G., Smolka, A.J., Morris, J.S., Coombes, K.R., Baggerly,K.A., Almeida, J.S.: PrepMS: TOF MS data graphical preprocessing tool. Bioinfor-matics 23(2), 264–265 (2007)

6. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical andpowerful approach to multiple testing. J. Roy. Statist. Soc. B 57, 289–300 (1995)

7. Voortman, J., Pham, T.V., Knol, J.C., Giaccone, G., Jimenez, C.R.: Time-courseMALDI-TOF-MS serum peptide profiling of non-small cell lung cancer patientstreated with bortezomib, cisplatin and gemcitabine. In: Proceedings of AmericanSociety of Clinical Oncology (ASCO) 2008 Annual Meeting, Chicago, USA (2008)

8. Mertens, B.: International competition on mass spectrometry proteomic diagnosis.Statistical Applications in Genetics and Molecular Biology 7(2), Article 1 (2008)

9. Pham, T.V., van de Wiel, M.A., Jimenez, C.R.: Support vector machine approachto separate control and breast cancer serum samples. Statistical Applications inGenetics and Molecular Biology 7(2), Article 11 (January 2008)

Date post:	21-Dec-2016
Category:	Documents
Upload:	ovidio
View:	212 times
Download:	0 times

[Lecture Notes in Computer Science] Advances in Mass Data Analysis of Images and Signals in...

Documents