Use of correspondence discriminant analysis to predict the subcellular location of bacterial proteins
Introduction
In Gram negative bacteria, after being synthetized by the translation apparatus, a protein can stay in the cytoplasm or be exported. In the case of exported (but not secreted) proteins, four possible subcellular locations exist: the inner (plasmidic) membrane, the periplasmic space, the cell wall and the outer membrane. In sequence databases, information on the subcellular location is available for some proteins only. In the case of Gram negative bacteria, this information is given for 5325 proteins sequences among 16 561 (32%) in SWISS-PROT 38 [1]. It is then interesting to have a general and simple method able to predict the location when this information is not known.
Multivariate statistics are particularly adapted to study compositional data in proteins. For example, correspondence analysis (CA) was employed to determine trends in amino acids usage in Escherichia coli [2]. Co-inertia analysis has been used to examine amino-acid physico-chemical properties and protein composition [3]. Discriminant analysis (DA) has served to determinate protein secondary structural segments [4], to differentiate intracellular and extracellular proteins [5], and to detect membrane-spanning proteins [6].
Correspondence discriminant analysis (CDA) is a method that can be used on frequency tables while the classical DA is limited to quantitative variables. So CDA can easily be employed with sequence data such as codon or amino acid frequencies tables. In a previous study, we used CDA to predict the subcellular location of E. coli proteins divided into three classes: cytoplasmic, periplasmic, and integral membrane proteins [7]. The good results obtained convinced us to extend the use of this method to all Gram negative bacteria.
Section snippets
General presentation
CDA is a peculiar case of the duality diagram [8], [9]: a triplet (Z, M, N) is made of an n by p data table Z, a matrix M defining an Euclidean metric in the subject space E=p, and a matrix N defining an Euclidean metric in the variable space F=n. From this we deduce by matrix diagonalization four families of vectors with several optimality properties (Fig. 1). We use this diagram in the following peculiar case: X=[xij] is a contingency table with q (proteins) lines and p (amino acids) columns.
Data set
To establish our data set we have used the release 38 of SWISS-PROT structured with the ACNUC sequence database management system [12]. The advantage provided by SWISS-PROT is the fact that almost all exact redundancies have been removed so that there is no risk to introduce biases due to sequence duplications. In our data set we have discarded hypothetical proteins, partial proteins, proteins with less than 50 amino acids, proteins without any indication of their subcellular location and
Results
The map obtained by crossing the two factors of the CDA performed on our analysis set showed that the first factor separates the integral membrane proteins from the cytoplasmic and periplasmic proteins, while the second factor separates the periplasmic proteins from the cytoplasmic and integral membrane proteins (Fig. 4). On the first factor, the mean of the scores obtained by integral membrane proteins was −1.060 (S.D.=0.754) and the mean of the scores obtained by the other proteins was 0.577
Discussion
The results obtained in the discrimination of proteins from Gram negative bacteria following their subcellular location confirm and extend our previous results on E. coli [7]. The discrimination of integral membrane proteins by amino acids like Phe, Leu and Ile is not surprising as these amino acids are known to be hydrophobic. Also, discrimination of the cytoplasmic proteins by Arg, Glu and His can be easily explained as these three amino acids are charged and hydrophilic and so are required
Acknowledgements
Thanks are due to Manolo Gouy for his helpful comments and careful reading of the manuscript and to Daniel Chessel for his help on CDA mathematical basis.
References (16)
- et al.
Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies
J. Mol. Biol.
(1994) - et al.
The detection and classification of membrane-spanning proteins
Biochim. Biophys. Acta
(1985) - et al.
NetMul, a World-Wide Web user interface for multivariate analysis software
Comput. Stat. Data Anal.
(1996) - et al.
Evidence for horizontal gene transfer in Escherichia coli speciation
J. Mol. Biol.
(1991) - et al.
The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 2000
Nucleic Acids Res.
(2000) - et al.
Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes
Nucleic Acids Res.
(1994) - et al.
Co-inertia analysis of amino-acid physico-chemical properties and protein composition with the ADE package
Comput. Appl. Biosci.
(1995) A multivariate analysis method for discriminating protein secondary structural segments
Protein Eng.
(1988)