On optimal pairwise linear classifiers for normal distributions: the d-dimensional case

doi:10.1016/S0031-3203(02)00053-5

Pattern Recognition

Volume 36, Issue 1, January 2003, Pages 13-23

https://doi.org/10.1016/S0031-3203(02)00053-5 Get rights and content

Abstract

We consider the well-studied pattern recognition problem of designing linear classifiers. When dealing with normally distributed classes, it is well known that the optimal Bayes classifier is linear only when the covariance matrices are equal. This was the only known condition for classifier linearity. In a previous work, we presented the theoretical framework for optimal pairwise linear classifiers for two-dimensional normally distributed random vectors. We derived the necessary and sufficient conditions that the distributions have to satisfy so as to yield the optimal linear classifier as a pair of straight lines.

In this paper we extend the previous work to d-dimensional normally distributed random vectors. We provide the necessary and sufficient conditions needed so that the optimal Bayes classifier is a pair of hyperplanes. Various scenarios have been considered including one which resolves the multi-dimensional Minsky’s paradox for the perceptron. We have also provided some three-dimensional examples for all the cases, and tested the classification accuracy of the corresponding pairwise-linear classifier. In all the cases, these linear classifiers achieve very good performance. To demonstrate that the current pairwise-linear philosophy yields superior discriminants on real-life data, we have shown how linear classifiers determined using a maximum-likelihood estimate (MLE) applicable for this approach, yield better accuracy than the discriminants obtained by the traditional Fisher's classifier on a real-life data set. The multi-dimensional generalization of the MLE for these classifiers is currently being investigated.

Introduction

The problem of finding linear classifiers has been the study of many researchers in the field of pattern recognition (PR) [3], [17], [18], [19]. Linear classifiers are very important because of their simplicity when it concerns implementation, and their classification speed. Various schemes to yield linear classifiers are reported in the literature such as Fisher’s approach [3], [4], [5], the perceptron algorithm (the basis of the back propagation neural network learning algorithms) [6], [7], [8], [9], piecewise recognition models [10], random search optimization [11], removal classification structures [12], adaptive linear dimensionality reduction [13] (which outperforms Fisher's classifier for some data sets), and linear constrained distance-based classifier analysis [14] (an improvement to Fisher's approach designed for hyperspectral image classification). All of these approaches suffer from the lack of optimality, and thus, although they do determine linear classification functions, the classifier is not optimal.

Apart from the results reported in [1], [15], in statistical PR, the Bayesian linear classification for normally distributed classes involves a single case. This traditional case is when the covariance matrices are equal [16], [17], [18]. In this case, the classifier is a single straight line (or a hyperplane in the d-dimensional case) completely specified by a first-order equation.

In [1], [15], we showed that although the general classifier for two-dimensional normally distributed random vectors is a second-degree polynomial, this polynomial degenerates to be either a single straight line or a pair of straight lines. Thus, we have found the necessary and sufficient conditions under which the classifier can be linear even when the covariance matrices are not equal. In this case, the classification function is a pair of first-order equations, which are factors of the second-order polynomial (i.e. the classification function). When the factors are equal, the classification function is given by a single straight line, which corresponds to the traditional case when the covariance matrices are equal.

Some examples of pairwise-linear classifiers for two and three-dimensional normally distributed random vectors can be found in Ref. [3, pp. 42–43]. By studying these, the reader should observe that the existence of such classifiers was known. The novelty of our results are the conditions for pairwise-linear classifiers, and the demonstration that these, in their own right, lead to superior linear classifiers.

In this paper, we extend these conditions for d-dimensional normal random vectors, where d>2. We assume that the features of an object to be recognized are represented as a d-dimensional vector which is an ordered tuple X=[x₁…x_d]^T characterized by a probability distribution function. We deal only with the case in which these random vectors have a jointly normal distribution, where class ω_i has a mean M_i and covariance matrix $Σ_{i}, i=1,2$ .

Without loss of generality, we assume that the classes ω₁ and ω₂ have the same a priori probability, 0.5, in which case, the classifier is given by $log |Σ_{2} | |Σ_{1} | −(X−M_{1})^{T} Σ_{1}^{−1} (X−M_{1})+ (X−M_{2})^{T} Σ_{2}^{−1} (X−M_{2})=0.$

When Σ₁=Σ₂, the classification function is linear [3], [18], [19]. For the case when Σ₁ and Σ₂ are arbitrary, the classifier results in a general equation of second degree which results in the classifier being a hyperparaboloid, a hyperellipsoid, a hypersphere, a hyperboloid, or a pair of hyperplanes. This latter case is the focus of our present study.

The results presented here have been rigorously tested. In particular, we present some empirical results for the cases in which the optimal Bayes classifier is a pair of hyperplanes. It is worth mentioning that we tested the case of Minsky's paradox [20] on randomly generated samples, and we have found that the accuracy is very high even though the classes are significantly overlapping.

Section snippets

Linear classifiers for diagonalized classes: the 2-D case

The concept of diagonalization is quite fundamental to our study. Diagonalization is the process of transforming a space by performing linear and whitening transformations [19]. Consider a normally distributed random vector, $X$ , with any mean vector and covariance matrix. By performing diagonalization, $X$ can be transformed into another normally distributed random vector, $Z$ , whose covariance is the identity matrix. This can be easily generalized to incorporate what is called “simultaneous

Multi-dimensional pairwise hyperplane classifiers

Let us consider now the more general case for d>2. Using the results mentioned above, we derive the necessary and sufficient conditions for a pairwise-linear optimal Bayes classifier. From the inequality constraints (a) and (b) of Theorem 1, we state and prove that it is not possible to find the optimal Bayes classifier as a pair of hyperplanes for these conditions when d>2. We modify the notation marginally. We use the symbols (a₁⁻¹,a₂⁻¹,…,a_d⁻¹) to synonymously refer to the marginal variances (

Linear classifiers with different means

In Ref. [1], we have shown that given two normally distributed random vectors, $X_{1}$ and $X_{2}$ , with mean vectors and covariance matrices of the form: $M_{1} = r s, M_{2} = −r −s, Σ_{1} = a^{−1} 00 b^{−1} and Σ_{2} = b^{−1} 00 a^{−1}$ the optimal Bayes classifier is a pair of straight lines when r²=s², where a and b are any positive real numbers. The classifier for this case is given by $a(x−r)^{2} +b(y−s)^{2} −b(x+r)^{2} −a(y+s)^{2} =0.$

We consider now the more general case for d>2. We are interested in finding the conditions that guarantee a pairwise-linear

Linear classifiers with equal means

We consider now a particular instance of the problem discussed in Section 4, which leads to the resolution of the generalization of the d-dimensional Minsky's paradox. In this case, the covariance matrices have the form of Eq. (10), but the mean vectors are the same for both classes. We shall now show that, with these parameters, it is always possible to find a pair of hyperplanes, which resolves Minsky's paradox in the most general case.

Theorem 7

Let $X_{1} ∼N(M_{1},Σ_{1})$ and $X_{2} ∼N(M_{2},Σ_{2})$ be two normal random

Simulation results for synthetic data

In order to test the accuracy of the pairwise linear classifiers and to verify the results derived here, we have performed some simulations for the different cases discussed above. We have chosen the dimension d=3, since it is easy to visualize and plot the corresponding hyperplanes. In all the simulations, we trained our classifier using 100 randomly generated training samples (which were three-dimensional vectors from the corresponding classes). Using the maximum-likelihood estimation (MLE)

Pairwise-linear classifiers on real-life data

Real-life data sets that satisfy the constraints given in , , are not very common, and in general, a classification scheme should be applicable to any data set, or eventually to a particular domain. Having demonstrated how our results are applicable to synthetic data sets, we now propose a method that substitutes the actual parameters of the data sets by approximated parameters for which the required constraints are satisfied. These parameters, in turn, are obtained by solving a constrained

Conclusions

In this paper, we have extended the theoretical framework of obtaining optimal pairwise-linear classifiers for normally distributed classes to d-dimensional normally distributed random vectors, where d>2.

We have determined the necessary and sufficient conditions for an optimal pairwise-linear classifier when the covariance matrices are the identity and a diagonal matrix. In this case, we have formally shown that it is possible to find the optimal linear classifier by satisfying certain

Acknowledgements

The authors are very grateful to the anonymous referee for his/her suggestions on enhancing our theorems to also include the necessary and sufficient conditions for the respective results. By utilizing these suggestions, we were also able to extend the previous results obtained for the two-dimensional features [1], thus improving the quality of both the previous paper, and of this present one. We are truly grateful. A preliminary version of this paper can be found in the Proceedings of AI’01,

About the Author—LUIS G. RUEDA received the degree of “Licenciado” in Computer Science from National University of San Juan, Argentina, in 1993. He obtained his Masters degree in Computer Science from Carleton University, Canada, in 1998. He has recently completed his Ph.D. in Computer Science at the School of Computer Science at Carleton University, Canada. His research interests include statistical pattern recognition, lossless data compression, cryptography, and database query optimization.

References (20)

S. Raudys
Evolution and generalization of a single neurone: I. Single-layer perception as seven statistical classifiers
Neural Networks
(1998)
S. Raudys
Evolution and generalization of a single neurone: II. Complexity of statistical classifiers and sample size considerations
Neural Networks
(1998)
R. Lotlikar et al.
Adaptive linear dinensionality reduction for classification
Pattern Recognition
(2000)
Q. Du et al.
A linear constrained distance-based discriminant analysis for hyperspectral image classification
Pattern Recognition
(2001)
L. Rueda, B.J. Oommen, On optimal pairwise linear classifiers for normal distributions: The Two-Dimensional Case, IEEE...
R. Duda et al.
Pattern Classification
(2000)
W. Malina
On an extended fisher criterion for feature selection
IEEE Trans. Pattern Anal. Mach. Intell.
(1981)
R. Schalkoff
Pattern Recognition: Statistical, Structural and Neural Approaches
(1992)
R. Lippman
An Introduction to Computing with Neural Nets
O. Murphy
Nearest Neighbor Pattern Classification Perceptrons

There are more references available in the full text version of this article.

Cited by (8)

An efficient approach to compute the threshold for multi-dimensional linear classifiers
2004, Pattern Recognition
In this paper, we theoretically analyze some properties that relate Fisher's classifier and the optimal quadratic classifier, when the latter is derived utilizing a particular covariance matrix for the classes. We propose an efficient approach which is used to select the threshold after a linear transformation onto the one-dimensional space is performed. We achieve this by selecting the decision boundary that minimizes the classification error in the transformed space, assuming that the univariate random variables are normally distributed. Our empirical results on synthetic and real-life data sets show that our approach lead to smaller classification error than the traditional Fisher's classifier. The results also demonstrate that minimizing the classification error in the transformed space leads to smaller classification error in the original multi-dimensional space.
Selecting the best hyperplane in the framework of optimal pairwise linear classifiers
2004, Pattern Recognition Letters
Citation Excerpt :
The only case that the optimal classifier was known to be linear is when two normally distributed classes have identical covariance matrices (Duda et al., 2000; Webb, 2002). In (Rueda and Oommen, 2002, 2003), it has been shown that the optimal classifier can be of the form of a pair of hyperplanes even though the covariance matrices are different. As mentioned in Section 1, for two normally distributed classes the optimal classifier is a quadratic function that represents a hyperquadric in the d-dimensional space.
In this paper, we introduce a new approach to selecting the best hyperplane from the pairwise classifier (BHPC) when the optimal pairwise linear classifier is given. We first propose a procedure for selecting the BHPC, and analyze the conditions in which the BHPC is selected. In one of the cases, it is formally shown that the BHPC and Fisher’s classifier (FC) are coincident. To evaluate the efficiency of the new classifier, we present an empirical and graphical analysis on synthetic data and real-life datasets from the UCI machine learning repository, which involves the optimal quadratic classifier, the BHPC, the optimal pairwise linear classifier, and FC. A numerical analysis of the classification error for these classifiers is also included. The results obtained demonstrate that the BHPC is more accurate than FC, and achieves nearly optimal classification.
IN SEARCH OF THE MATERIAL COMPOSITION OF REFUSE-DERIVED FUELS BY MEANS OF DATA RECONCILIATION AND GRAPHICAL REPRESENTATION
2023, Detritus
A new approach to multi-class linear dimensionality reduction
2006, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
A theoretical comparison of two linear dimensionality reduction techniques
2006, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
An empirical evaluation of the classification error of two thresholding methods for Fisher's classifier
2004, Proceedings of the International Conference on Artificial Intelligence, IC-AI'04

View all citing articles on Scopus

About the Author—B. JOHN OOMMEN obtained his B.Tech. degree from the Indian Institute of Technology, Madras, India in 1975. He obtained his M.E. from the Indian Institute of Science in Bangalore, India in 1977. He then went on for his M.S. and Ph.D. which he obtained from Purdue University, in West Lafayettte, Indiana in 1979 and 1982, respectively. He joined the School of Computer Science at Carleton University in Ottawa, Canada, in the 1981–1982 academic year. He is still at Carleton and holds the rank of a Full Professor. His research interests include Automata Learning, Adaptive Data Structures, Statistical and Syntactic Pattern Recognition, Stochastic Algorithms and Partitioning Algorithms. He is the author of over 185 refereed journal and conference publications, and is a Senior Member of the IEEE. He is also on the Editorial board for the IEEE Transactions on Systems, Man and Cybernetics, and for Pattern Recognition.

¹: The work of this author was partially supported by Departamento de Informática, Universidad Nacional de San Juan, Argentina.

²: Partially supported by NSERC, the National Science and Engineering Research Council of Canada.

View full text

On optimal pairwise linear classifiers for normal distributions: the d-dimensional case

Abstract

Introduction

Section snippets

Linear classifiers for diagonalized classes: the 2-D case

Multi-dimensional pairwise hyperplane classifiers

Linear classifiers with different means

Linear classifiers with equal means

Simulation results for synthetic data

Pairwise-linear classifiers on real-life data

Conclusions

Acknowledgements

Neural Networks

Neural Networks

Pattern Recognition

Pattern Recognition

Pattern Classification

On an extended fisher criterion for feature selection

IEEE Trans. Pattern Anal. Mach. Intell.

Pattern Recognition: Statistical, Structural and Neural Approaches

An Introduction to Computing with Neural Nets

Nearest Neighbor Pattern Classification Perceptrons