On optimal pairwise linear classifiers for normal distributions: the d-dimensional case
Introduction
The problem of finding linear classifiers has been the study of many researchers in the field of pattern recognition (PR) [3], [17], [18], [19]. Linear classifiers are very important because of their simplicity when it concerns implementation, and their classification speed. Various schemes to yield linear classifiers are reported in the literature such as Fisher’s approach [3], [4], [5], the perceptron algorithm (the basis of the back propagation neural network learning algorithms) [6], [7], [8], [9], piecewise recognition models [10], random search optimization [11], removal classification structures [12], adaptive linear dimensionality reduction [13] (which outperforms Fisher's classifier for some data sets), and linear constrained distance-based classifier analysis [14] (an improvement to Fisher's approach designed for hyperspectral image classification). All of these approaches suffer from the lack of optimality, and thus, although they do determine linear classification functions, the classifier is not optimal.
Apart from the results reported in [1], [15], in statistical PR, the Bayesian linear classification for normally distributed classes involves a single case. This traditional case is when the covariance matrices are equal [16], [17], [18]. In this case, the classifier is a single straight line (or a hyperplane in the d-dimensional case) completely specified by a first-order equation.
In [1], [15], we showed that although the general classifier for two-dimensional normally distributed random vectors is a second-degree polynomial, this polynomial degenerates to be either a single straight line or a pair of straight lines. Thus, we have found the necessary and sufficient conditions under which the classifier can be linear even when the covariance matrices are not equal. In this case, the classification function is a pair of first-order equations, which are factors of the second-order polynomial (i.e. the classification function). When the factors are equal, the classification function is given by a single straight line, which corresponds to the traditional case when the covariance matrices are equal.
Some examples of pairwise-linear classifiers for two and three-dimensional normally distributed random vectors can be found in Ref. [3, pp. 42–43]. By studying these, the reader should observe that the existence of such classifiers was known. The novelty of our results are the conditions for pairwise-linear classifiers, and the demonstration that these, in their own right, lead to superior linear classifiers.
In this paper, we extend these conditions for d-dimensional normal random vectors, where d>2. We assume that the features of an object to be recognized are represented as a d-dimensional vector which is an ordered tuple X=[x1…xd]T characterized by a probability distribution function. We deal only with the case in which these random vectors have a jointly normal distribution, where class ωi has a mean Mi and covariance matrix .
Without loss of generality, we assume that the classes ω1 and ω2 have the same a priori probability, 0.5, in which case, the classifier is given by
When Σ1=Σ2, the classification function is linear [3], [18], [19]. For the case when Σ1 and Σ2 are arbitrary, the classifier results in a general equation of second degree which results in the classifier being a hyperparaboloid, a hyperellipsoid, a hypersphere, a hyperboloid, or a pair of hyperplanes. This latter case is the focus of our present study.
The results presented here have been rigorously tested. In particular, we present some empirical results for the cases in which the optimal Bayes classifier is a pair of hyperplanes. It is worth mentioning that we tested the case of Minsky's paradox [20] on randomly generated samples, and we have found that the accuracy is very high even though the classes are significantly overlapping.
Section snippets
Linear classifiers for diagonalized classes: the 2-D case
The concept of diagonalization is quite fundamental to our study. Diagonalization is the process of transforming a space by performing linear and whitening transformations [19]. Consider a normally distributed random vector, , with any mean vector and covariance matrix. By performing diagonalization, can be transformed into another normally distributed random vector, , whose covariance is the identity matrix. This can be easily generalized to incorporate what is called “simultaneous
Multi-dimensional pairwise hyperplane classifiers
Let us consider now the more general case for d>2. Using the results mentioned above, we derive the necessary and sufficient conditions for a pairwise-linear optimal Bayes classifier. From the inequality constraints (a) and (b) of Theorem 1, we state and prove that it is not possible to find the optimal Bayes classifier as a pair of hyperplanes for these conditions when d>2. We modify the notation marginally. We use the symbols (a1−1,a2−1,…,ad−1) to synonymously refer to the marginal variances (
Linear classifiers with different means
In Ref. [1], we have shown that given two normally distributed random vectors, and , with mean vectors and covariance matrices of the form:the optimal Bayes classifier is a pair of straight lines when r2=s2, where a and b are any positive real numbers. The classifier for this case is given by
We consider now the more general case for d>2. We are interested in finding the conditions that guarantee a pairwise-linear
Linear classifiers with equal means
We consider now a particular instance of the problem discussed in Section 4, which leads to the resolution of the generalization of the d-dimensional Minsky's paradox. In this case, the covariance matrices have the form of Eq. (10), but the mean vectors are the same for both classes. We shall now show that, with these parameters, it is always possible to find a pair of hyperplanes, which resolves Minsky's paradox in the most general case. Theorem 7 Let and be two normal random
Simulation results for synthetic data
In order to test the accuracy of the pairwise linear classifiers and to verify the results derived here, we have performed some simulations for the different cases discussed above. We have chosen the dimension d=3, since it is easy to visualize and plot the corresponding hyperplanes. In all the simulations, we trained our classifier using 100 randomly generated training samples (which were three-dimensional vectors from the corresponding classes). Using the maximum-likelihood estimation (MLE)
Pairwise-linear classifiers on real-life data
Real-life data sets that satisfy the constraints given in , , are not very common, and in general, a classification scheme should be applicable to any data set, or eventually to a particular domain. Having demonstrated how our results are applicable to synthetic data sets, we now propose a method that substitutes the actual parameters of the data sets by approximated parameters for which the required constraints are satisfied. These parameters, in turn, are obtained by solving a constrained
Conclusions
In this paper, we have extended the theoretical framework of obtaining optimal pairwise-linear classifiers for normally distributed classes to d-dimensional normally distributed random vectors, where d>2.
We have determined the necessary and sufficient conditions for an optimal pairwise-linear classifier when the covariance matrices are the identity and a diagonal matrix. In this case, we have formally shown that it is possible to find the optimal linear classifier by satisfying certain
Acknowledgements
The authors are very grateful to the anonymous referee for his/her suggestions on enhancing our theorems to also include the necessary and sufficient conditions for the respective results. By utilizing these suggestions, we were also able to extend the previous results obtained for the two-dimensional features [1], thus improving the quality of both the previous paper, and of this present one. We are truly grateful. A preliminary version of this paper can be found in the Proceedings of AI’01,
About the Author—LUIS G. RUEDA received the degree of “Licenciado” in Computer Science from National University of San Juan, Argentina, in 1993. He obtained his Masters degree in Computer Science from Carleton University, Canada, in 1998. He has recently completed his Ph.D. in Computer Science at the School of Computer Science at Carleton University, Canada. His research interests include statistical pattern recognition, lossless data compression, cryptography, and database query optimization.
References (20)
Evolution and generalization of a single neurone: I. Single-layer perception as seven statistical classifiers
Neural Networks
(1998)Evolution and generalization of a single neurone: II. Complexity of statistical classifiers and sample size considerations
Neural Networks
(1998)- et al.
Adaptive linear dinensionality reduction for classification
Pattern Recognition
(2000) - et al.
A linear constrained distance-based discriminant analysis for hyperspectral image classification
Pattern Recognition
(2001) - L. Rueda, B.J. Oommen, On optimal pairwise linear classifiers for normal distributions: The Two-Dimensional Case, IEEE...
- et al.
Pattern Classification
(2000) On an extended fisher criterion for feature selection
IEEE Trans. Pattern Anal. Mach. Intell.
(1981)Pattern Recognition: Statistical, Structural and Neural Approaches
(1992)An Introduction to Computing with Neural Nets
Nearest Neighbor Pattern Classification Perceptrons
Cited by (8)
An efficient approach to compute the threshold for multi-dimensional linear classifiers
2004, Pattern RecognitionSelecting the best hyperplane in the framework of optimal pairwise linear classifiers
2004, Pattern Recognition LettersCitation Excerpt :The only case that the optimal classifier was known to be linear is when two normally distributed classes have identical covariance matrices (Duda et al., 2000; Webb, 2002). In (Rueda and Oommen, 2002, 2003), it has been shown that the optimal classifier can be of the form of a pair of hyperplanes even though the covariance matrices are different. As mentioned in Section 1, for two normally distributed classes the optimal classifier is a quadratic function that represents a hyperquadric in the d-dimensional space.
A new approach to multi-class linear dimensionality reduction
2006, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)A theoretical comparison of two linear dimensionality reduction techniques
2006, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)An empirical evaluation of the classification error of two thresholding methods for Fisher's classifier
2004, Proceedings of the International Conference on Artificial Intelligence, IC-AI'04
About the Author—LUIS G. RUEDA received the degree of “Licenciado” in Computer Science from National University of San Juan, Argentina, in 1993. He obtained his Masters degree in Computer Science from Carleton University, Canada, in 1998. He has recently completed his Ph.D. in Computer Science at the School of Computer Science at Carleton University, Canada. His research interests include statistical pattern recognition, lossless data compression, cryptography, and database query optimization.
About the Author—B. JOHN OOMMEN obtained his B.Tech. degree from the Indian Institute of Technology, Madras, India in 1975. He obtained his M.E. from the Indian Institute of Science in Bangalore, India in 1977. He then went on for his M.S. and Ph.D. which he obtained from Purdue University, in West Lafayettte, Indiana in 1979 and 1982, respectively. He joined the School of Computer Science at Carleton University in Ottawa, Canada, in the 1981–1982 academic year. He is still at Carleton and holds the rank of a Full Professor. His research interests include Automata Learning, Adaptive Data Structures, Statistical and Syntactic Pattern Recognition, Stochastic Algorithms and Partitioning Algorithms. He is the author of over 185 refereed journal and conference publications, and is a Senior Member of the IEEE. He is also on the Editorial board for the IEEE Transactions on Systems, Man and Cybernetics, and for Pattern Recognition.
- 1
The work of this author was partially supported by Departamento de Informática, Universidad Nacional de San Juan, Argentina.
- 2
Partially supported by NSERC, the National Science and Engineering Research Council of Canada.