Use of correspondence discriminant analysis to predict the subcellular location of bacterial proteins

https://doi.org/10.1016/S0169-2607(02)00011-1Get rights and content

Abstract

Correspondence discriminant analysis (CDA) is a multivariate statistical method derived from discriminant analysis which can be used on contingency tables. We have used CDA to separate Gram negative bacteria proteins according to their subcellular location. The high resolution of the discrimination obtained makes this method a good tool to predict subcellular location when this information is not known. The main advantage of this technique is its simplicity. Indeed, by computing two linear formulae on amino acid composition, it is possible to classify a protein into one of the three classes of subcellular location we have defined. The CDA itself can be computed with the ADE-4 software package that can be downloaded, as well as the data set used in this study, from the Pôle Bio-Informatique Lyonnais (PBIL) server at http://pbil.univ-lyon1.fr.

Introduction

In Gram negative bacteria, after being synthetized by the translation apparatus, a protein can stay in the cytoplasm or be exported. In the case of exported (but not secreted) proteins, four possible subcellular locations exist: the inner (plasmidic) membrane, the periplasmic space, the cell wall and the outer membrane. In sequence databases, information on the subcellular location is available for some proteins only. In the case of Gram negative bacteria, this information is given for 5325 proteins sequences among 16 561 (32%) in SWISS-PROT 38 [1]. It is then interesting to have a general and simple method able to predict the location when this information is not known.

Multivariate statistics are particularly adapted to study compositional data in proteins. For example, correspondence analysis (CA) was employed to determine trends in amino acids usage in Escherichia coli [2]. Co-inertia analysis has been used to examine amino-acid physico-chemical properties and protein composition [3]. Discriminant analysis (DA) has served to determinate protein secondary structural segments [4], to differentiate intracellular and extracellular proteins [5], and to detect membrane-spanning proteins [6].

Correspondence discriminant analysis (CDA) is a method that can be used on frequency tables while the classical DA is limited to quantitative variables. So CDA can easily be employed with sequence data such as codon or amino acid frequencies tables. In a previous study, we used CDA to predict the subcellular location of E. coli proteins divided into three classes: cytoplasmic, periplasmic, and integral membrane proteins [7]. The good results obtained convinced us to extend the use of this method to all Gram negative bacteria.

Section snippets

General presentation

CDA is a peculiar case of the duality diagram [8], [9]: a triplet (Z, M, N) is made of an n by p data table Z, a matrix M defining an Euclidean metric in the subject space E=Rp, and a matrix N defining an Euclidean metric in the variable space F=Rn. From this we deduce by matrix diagonalization four families of vectors with several optimality properties (Fig. 1). We use this diagram in the following peculiar case: X=[xij] is a contingency table with q (proteins) lines and p (amino acids) columns.

Data set

To establish our data set we have used the release 38 of SWISS-PROT structured with the ACNUC sequence database management system [12]. The advantage provided by SWISS-PROT is the fact that almost all exact redundancies have been removed so that there is no risk to introduce biases due to sequence duplications. In our data set we have discarded hypothetical proteins, partial proteins, proteins with less than 50 amino acids, proteins without any indication of their subcellular location and

Results

The map obtained by crossing the two factors of the CDA performed on our analysis set showed that the first factor separates the integral membrane proteins from the cytoplasmic and periplasmic proteins, while the second factor separates the periplasmic proteins from the cytoplasmic and integral membrane proteins (Fig. 4). On the first factor, the mean of the scores obtained by integral membrane proteins was −1.060 (S.D.=0.754) and the mean of the scores obtained by the other proteins was 0.577

Discussion

The results obtained in the discrimination of proteins from Gram negative bacteria following their subcellular location confirm and extend our previous results on E. coli [7]. The discrimination of integral membrane proteins by amino acids like Phe, Leu and Ile is not surprising as these amino acids are known to be hydrophobic. Also, discrimination of the cytoplasmic proteins by Arg, Glu and His can be easily explained as these three amino acids are charged and hydrophilic and so are required

Acknowledgements

Thanks are due to Manolo Gouy for his helpful comments and careful reading of the manuscript and to Daniel Chessel for his help on CDA mathematical basis.

References (16)

There are more references available in the full text version of this article.

Cited by (0)

View full text