Elsevier

Pattern Recognition

Volume 36, Issue 1, January 2003, Pages 13-23
Pattern Recognition

On optimal pairwise linear classifiers for normal distributions: the d-dimensional case

https://doi.org/10.1016/S0031-3203(02)00053-5Get rights and content

Abstract

We consider the well-studied pattern recognition problem of designing linear classifiers. When dealing with normally distributed classes, it is well known that the optimal Bayes classifier is linear only when the covariance matrices are equal. This was the only known condition for classifier linearity. In a previous work, we presented the theoretical framework for optimal pairwise linear classifiers for two-dimensional normally distributed random vectors. We derived the necessary and sufficient conditions that the distributions have to satisfy so as to yield the optimal linear classifier as a pair of straight lines.

In this paper we extend the previous work to d-dimensional normally distributed random vectors. We provide the necessary and sufficient conditions needed so that the optimal Bayes classifier is a pair of hyperplanes. Various scenarios have been considered including one which resolves the multi-dimensional Minskys paradox for the perceptron. We have also provided some three-dimensional examples for all the cases, and tested the classification accuracy of the corresponding pairwise-linear classifier. In all the cases, these linear classifiers achieve very good performance. To demonstrate that the current pairwise-linear philosophy yields superior discriminants on real-life data, we have shown how linear classifiers determined using a maximum-likelihood estimate (MLE) applicable for this approach, yield better accuracy than the discriminants obtained by the traditional Fisher's classifier on a real-life data set. The multi-dimensional generalization of the MLE for these classifiers is currently being investigated.

Introduction

The problem of finding linear classifiers has been the study of many researchers in the field of pattern recognition (PR) [3], [17], [18], [19]. Linear classifiers are very important because of their simplicity when it concerns implementation, and their classification speed. Various schemes to yield linear classifiers are reported in the literature such as Fishers approach [3], [4], [5], the perceptron algorithm (the basis of the back propagation neural network learning algorithms) [6], [7], [8], [9], piecewise recognition models [10], random search optimization [11], removal classification structures [12], adaptive linear dimensionality reduction [13] (which outperforms Fisher's classifier for some data sets), and linear constrained distance-based classifier analysis [14] (an improvement to Fisher's approach designed for hyperspectral image classification). All of these approaches suffer from the lack of optimality, and thus, although they do determine linear classification functions, the classifier is not optimal.

Apart from the results reported in [1], [15], in statistical PR, the Bayesian linear classification for normally distributed classes involves a single case. This traditional case is when the covariance matrices are equal [16], [17], [18]. In this case, the classifier is a single straight line (or a hyperplane in the d-dimensional case) completely specified by a first-order equation.

In [1], [15], we showed that although the general classifier for two-dimensional normally distributed random vectors is a second-degree polynomial, this polynomial degenerates to be either a single straight line or a pair of straight lines. Thus, we have found the necessary and sufficient conditions under which the classifier can be linear even when the covariance matrices are not equal. In this case, the classification function is a pair of first-order equations, which are factors of the second-order polynomial (i.e. the classification function). When the factors are equal, the classification function is given by a single straight line, which corresponds to the traditional case when the covariance matrices are equal.

Some examples of pairwise-linear classifiers for two and three-dimensional normally distributed random vectors can be found in Ref. [3, pp. 42–43]. By studying these, the reader should observe that the existence of such classifiers was known. The novelty of our results are the conditions for pairwise-linear classifiers, and the demonstration that these, in their own right, lead to superior linear classifiers.

In this paper, we extend these conditions for d-dimensional normal random vectors, where d>2. We assume that the features of an object to be recognized are represented as a d-dimensional vector which is an ordered tuple X=[x1xd]T characterized by a probability distribution function. We deal only with the case in which these random vectors have a jointly normal distribution, where class ωi has a mean Mi and covariance matrix Σi,i=1,2.

Without loss of generality, we assume that the classes ω1 and ω2 have the same a priori probability, 0.5, in which case, the classifier is given bylog2|1|−(X−M1)TΣ1−1(X−M1)+(X−M2)TΣ2−1(X−M2)=0.

When Σ1=Σ2, the classification function is linear [3], [18], [19]. For the case when Σ1 and Σ2 are arbitrary, the classifier results in a general equation of second degree which results in the classifier being a hyperparaboloid, a hyperellipsoid, a hypersphere, a hyperboloid, or a pair of hyperplanes. This latter case is the focus of our present study.

The results presented here have been rigorously tested. In particular, we present some empirical results for the cases in which the optimal Bayes classifier is a pair of hyperplanes. It is worth mentioning that we tested the case of Minsky's paradox [20] on randomly generated samples, and we have found that the accuracy is very high even though the classes are significantly overlapping.

Section snippets

Linear classifiers for diagonalized classes: the 2-D case

The concept of diagonalization is quite fundamental to our study. Diagonalization is the process of transforming a space by performing linear and whitening transformations [19]. Consider a normally distributed random vector, X, with any mean vector and covariance matrix. By performing diagonalization, X can be transformed into another normally distributed random vector, Z, whose covariance is the identity matrix. This can be easily generalized to incorporate what is called “simultaneous

Multi-dimensional pairwise hyperplane classifiers

Let us consider now the more general case for d>2. Using the results mentioned above, we derive the necessary and sufficient conditions for a pairwise-linear optimal Bayes classifier. From the inequality constraints (a) and (b) of Theorem 1, we state and prove that it is not possible to find the optimal Bayes classifier as a pair of hyperplanes for these conditions when d>2. We modify the notation marginally. We use the symbols (a1−1,a2−1,…,ad−1) to synonymously refer to the marginal variances (

Linear classifiers with different means

In Ref. [1], we have shown that given two normally distributed random vectors, X1 and X2, with mean vectors and covariance matrices of the form:M1=rs,M2=−r−s,Σ1=a−100b−1andΣ2=b−100a−1the optimal Bayes classifier is a pair of straight lines when r2=s2, where a and b are any positive real numbers. The classifier for this case is given bya(x−r)2+b(y−s)2−b(x+r)2−a(y+s)2=0.

We consider now the more general case for d>2. We are interested in finding the conditions that guarantee a pairwise-linear

Linear classifiers with equal means

We consider now a particular instance of the problem discussed in Section 4, which leads to the resolution of the generalization of the d-dimensional Minsky's paradox. In this case, the covariance matrices have the form of Eq. (10), but the mean vectors are the same for both classes. We shall now show that, with these parameters, it is always possible to find a pair of hyperplanes, which resolves Minsky's paradox in the most general case.

Theorem 7

Let X1∼N(M11) and X2∼N(M22) be two normal random

Simulation results for synthetic data

In order to test the accuracy of the pairwise linear classifiers and to verify the results derived here, we have performed some simulations for the different cases discussed above. We have chosen the dimension d=3, since it is easy to visualize and plot the corresponding hyperplanes. In all the simulations, we trained our classifier using 100 randomly generated training samples (which were three-dimensional vectors from the corresponding classes). Using the maximum-likelihood estimation (MLE)

Pairwise-linear classifiers on real-life data

Real-life data sets that satisfy the constraints given in , , are not very common, and in general, a classification scheme should be applicable to any data set, or eventually to a particular domain. Having demonstrated how our results are applicable to synthetic data sets, we now propose a method that substitutes the actual parameters of the data sets by approximated parameters for which the required constraints are satisfied. These parameters, in turn, are obtained by solving a constrained

Conclusions

In this paper, we have extended the theoretical framework of obtaining optimal pairwise-linear classifiers for normally distributed classes to d-dimensional normally distributed random vectors, where d>2.

We have determined the necessary and sufficient conditions for an optimal pairwise-linear classifier when the covariance matrices are the identity and a diagonal matrix. In this case, we have formally shown that it is possible to find the optimal linear classifier by satisfying certain

Acknowledgements

The authors are very grateful to the anonymous referee for his/her suggestions on enhancing our theorems to also include the necessary and sufficient conditions for the respective results. By utilizing these suggestions, we were also able to extend the previous results obtained for the two-dimensional features [1], thus improving the quality of both the previous paper, and of this present one. We are truly grateful. A preliminary version of this paper can be found in the Proceedings of AI01,

About the Author—LUIS G. RUEDA received the degree of “Licenciado” in Computer Science from National University of San Juan, Argentina, in 1993. He obtained his Masters degree in Computer Science from Carleton University, Canada, in 1998. He has recently completed his Ph.D. in Computer Science at the School of Computer Science at Carleton University, Canada. His research interests include statistical pattern recognition, lossless data compression, cryptography, and database query optimization.

References (20)

There are more references available in the full text version of this article.

Cited by (8)

  • Selecting the best hyperplane in the framework of optimal pairwise linear classifiers

    2004, Pattern Recognition Letters
    Citation Excerpt :

    The only case that the optimal classifier was known to be linear is when two normally distributed classes have identical covariance matrices (Duda et al., 2000; Webb, 2002). In (Rueda and Oommen, 2002, 2003), it has been shown that the optimal classifier can be of the form of a pair of hyperplanes even though the covariance matrices are different. As mentioned in Section 1, for two normally distributed classes the optimal classifier is a quadratic function that represents a hyperquadric in the d-dimensional space.

  • A new approach to multi-class linear dimensionality reduction

    2006, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
  • A theoretical comparison of two linear dimensionality reduction techniques

    2006, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
  • An empirical evaluation of the classification error of two thresholding methods for Fisher's classifier

    2004, Proceedings of the International Conference on Artificial Intelligence, IC-AI'04
View all citing articles on Scopus

About the Author—LUIS G. RUEDA received the degree of “Licenciado” in Computer Science from National University of San Juan, Argentina, in 1993. He obtained his Masters degree in Computer Science from Carleton University, Canada, in 1998. He has recently completed his Ph.D. in Computer Science at the School of Computer Science at Carleton University, Canada. His research interests include statistical pattern recognition, lossless data compression, cryptography, and database query optimization.

About the Author—B. JOHN OOMMEN obtained his B.Tech. degree from the Indian Institute of Technology, Madras, India in 1975. He obtained his M.E. from the Indian Institute of Science in Bangalore, India in 1977. He then went on for his M.S. and Ph.D. which he obtained from Purdue University, in West Lafayettte, Indiana in 1979 and 1982, respectively. He joined the School of Computer Science at Carleton University in Ottawa, Canada, in the 1981–1982 academic year. He is still at Carleton and holds the rank of a Full Professor. His research interests include Automata Learning, Adaptive Data Structures, Statistical and Syntactic Pattern Recognition, Stochastic Algorithms and Partitioning Algorithms. He is the author of over 185 refereed journal and conference publications, and is a Senior Member of the IEEE. He is also on the Editorial board for the IEEE Transactions on Systems, Man and Cybernetics, and for Pattern Recognition.

1

The work of this author was partially supported by Departamento de Informática, Universidad Nacional de San Juan, Argentina.

2

Partially supported by NSERC, the National Science and Engineering Research Council of Canada.

View full text