Elsevier

Neurocomputing

Volume 71, Issues 16–18, October 2008, Pages 3070-3076
Neurocomputing

Fuzzy classification using information theoretic learning vector quantization

https://doi.org/10.1016/j.neucom.2008.04.048Get rights and content

Abstract

In this article we extend the (recently published) unsupervised information theoretic vector quantization approach based on the Cauchy–Schwarz-divergence for matching data and prototype densities to supervised learning and classification. In particular, first we generalize the unsupervised method to more general metrics instead of the Euclidean, as it was used in the original algorithm. Thereafter, we extend the model to a supervised learning method resulting in a fuzzy classification algorithm. Thereby, we allow fuzzy labels for both, data and prototypes. Finally, we transfer the idea of relevance learning for metric adaptation known from learning vector quantization to the new approach. We show the abilities and the power of the method for exemplary and real-world medical applications.

Introduction

Prototype based unsupervised vector quantization is an important task in pattern recognition. One basic advantage is the easy mapping scheme and the intuitive understanding by the concept of representative prototypes. Several prototype based methods have been established ranging from statistical approaches to neural vector quantizers [10], [19], [27]. Thereby, close connections to information theoretic learning can be drawn for neural vector quantizer [3], [5], [16], [18], [26], [28]. Based on the fundamental work of Zador, distance based vector quantization can be related to magnification in prototype based vector quantization which describes the relation between data and prototype density as a power law [45]. It can be used to design control strategies such that maximum mutual information between data and prototype density is obtained [1], [6], [39], [42]. However, this goal is achieved by a side effect. It is not directly optimized by the learning schemes because the origin of distance based vector quantization methods is to minimize variants of the description error [45], which, usually, does not optimize any information theoretic criterion. The respective control strategies have to be installed additionally to the usual prototype adaptation scheme, which, however, could generate side-effects, which may be contrary to the original goal of the vector quantizer (for instance, topographic mapping) [42].

Yet, vector quantization, which directly optimizes information theoretic approaches, becomes more and more important [5], [28], [40]. Two basic principles are widely used: maximization of the mutual information and minimization of divergence measures [24], [26]. Both criteria are equivalent for uniformly distributed data. Thereby, several entropy and divergence measures exist. Among the earliest, Shannon-entropy and Kullback–Leibler-divergence, provided the way for the other methods [21], [34]. One famous entropy class is the class of α-entropies Hα [29]. These entropies are generalizations of the Shannon-entropy. Introduced by A. Rényi, they show interesting properties, which are of special interest for numerical computation as discussed later in the paper [46]. In particular, the quadratic H2-entropy plays a distinguished role in this direction. Other divergences measure can be obtained using concepts of functional norms and application of their mathematical properties. J. Principe and colleagues have shown that, based on the Cauchy–Schwarz-inequality for the functional L2-norm a divergence measure can be derived, which, together with a consistently chosen Parzen-estimator for the unknown data densities, gives a numerically well-behaved approach of information optimum prototype based vector quantization [24].

In this contribution, first we extend this approach of information theoretic vector quantization such that it is applicable to more general functional norms keeping the prototype based principle of vector quantization as basis. Thus a broader range of application becomes possible. For example, data equipped with only pairwise similarity measure become tractable by this more general view. Further, we allow that the used similarity measure or functional norm may be dependent on additional parameters. In this way we obtain greater flexibility by a free choice of the parameters. Moreover, doing so, we are further able to optimize the metric itself parallel to the prototype distribution and, hence, the whole vector quantization model of the given data with respect to these metric parameters. Thus an information processing optimum metric can be achieved.

This strategy of task dependent metric adaptation is known in supervised learning vector quantization as (LVQ) relevance learning.

The main contribution in this paper is, that we extend the original approach of unsupervised information theoretic vector quantization introduced by J. Principe and colleagues to a supervised learning scheme. Thus, we transfer the ideas from the unsupervised information theoretic vector quantization to an information theoretic LVQ approach, which is a classification scheme or, equivalently spoken, a supervised learning scheme. Thereby, we allow the classification information of both data and prototypes to be fuzzy, i.e. we do not assume a crisp class decision for the training data or the adapted prototypes. Finally, we end up with a prototype based fuzzy classifier, which is an improvement in comparison to standard LVQ approaches, which usually provide crisp decisions and are not able to handle fuzzy labels for data or are not based on information theoretic principles.

The paper is organized as follows: First we review the approach of unsupervised information theoretic vector quantization introduced by J. Principe and colleagues, but in the more general variant of arbitrary functional metrics in Hilbert-spaces. Subsequently, we explain the new model for supervised fuzzy classification scheme based on the unsupervised method and show, how metric adaptation (relevance learning) can be integrated. Numerical considerations for artificial and real-world data demonstrate the abilities of the new classification system.

Section snippets

Information theoretic unsupervised vector quantization based on functional norms using the Hölder-inequality

In the following we shortly review the derivation of a numerically well-behaved divergence measure as proposed by J. Principe. It differs in some properties from the well-known Kullback–Leibler-divergence. However, it also vanishes for identical probability densities and, therefore, it can be used in density matching optimization tasks like prototype based vector quantization.

Let us start with the Shannon-entropy in differential form for a density function P(v), vRnH(ρ)=-P(v)log(P(v))dvIf the

Prototype based classification using Cauchy–Schwarz-divergence DCS

In the following we will extend the above approach to the task of prototype based classification. Prototype based classification is a very intuitive and robust method [43]. It includes the LVQ algorithms introduced by Kohonen [19]. However, the LVQ does not follow a gradient of any cost functions. The classification error is reduced only based on heuristics. For overlapping classes this heuristic causes instabilities [31]. Several modifications have been proposed to overcome this problem [31],

Metric adaptation—relevance learning

Up to now, we formulated the algorithm for general difference based distance measures ξ(v-w). Usually, the Euclidean distance is applied. However, it is possible to use more complicated difference based distance measures. For example, one can consider an arbitrary, parameterized distance measure ξλ with a parameter vector λ=(λ1,,λNλ), λi0 and λi=1. We assume that ξλ is continuously differentiable. An important example is the scaled (quadratic) Euclidean metricξλ(v,w)=k=1nλk(vk-wk)2In this

Artificial data and exemplary applications

In a first toy example we applied the LVQ-CSD using the quadratic Euclidean distance for ξ to classify data obtained from two two-dimensional overlapping Gaussian distribution, each of them defining a data class. The overall number of data was N=600 equally split into test and train data. We used 10 prototypes with randomly initialized positions and fuzzy labels.

One crucial point using Parzen estimators is the adequate choice of the kernel size σ2. Silverman's rule gives a rough estimation [35]

Conclusion

Based on the information theoretic approach of unsupervised vector quantization by density matching using Cauchy–Schwarz-divergence, we developed a new supervised learning vector quantization algorithm, which is able to handle fuzzy labels for data as well as for prototypes. In first toy applications the algorithm shows valuable results. In a realistic medical application we have demonstrated the power and the abilities of the new classification scheme and outlined possible conclusions in the

Thomas Villmann is a senior researcher at the Medical Department, University of Leipzig, Germany and leads the Computational Intelligence group (http://www.unileipzig.de/∼compint/). He holds a Ph.D. and the venia legendi in Computer Science. His research areas comprise the theory of prototype-based vector quantization, neural networks and machine learning as well as respective applications in medical data analysis, bioinformatics and satellite remote sensing. Several research stays have taken

References (46)

  • D. DeSieno, Adding a conscience to competitive learning, in: Proceedings of the ICNN’88, International Conference on...
  • D. Erdogmus, Information theoretic learning: Renyi's entropy and its application to adaptive systems training, Ph.D....
  • B. Hammer et al.

    Supervised neural gas with general similarity measure

    Neural Process. Lett.

    (2005)
  • S. Haykin

    Neural Networks—A Comprehensive Foundation

    (1994)
  • W. Hermann et al.

    Comparison of clinical types of Wilson's disease and glucose metabolism in extrapyramidal motor brain regions

    J. Neurol.

    (2002)
  • W. Hermann et al.

    Correlation between automated writing movements and striatal dopaminergic innervation in patients with Wilson's disease

    J. Neurol.

    (2002)
  • W. Hermann et al.

    Pyramidale Schädigung im Vergleich zur extrapyramidalmotorischen Beeinträchtigung bei Patienten mit Morbus Wilson

    Klin. Neurophysiol.

    (2007)
  • W. Hermann et al.

    Computergestützte Analyse der Handschrift bei Patienten mit Morbus Wilson

    Klin. Neurophysiol.

    (2007)
  • W. Hermann et al.

    Classification of fine-motoric disturbances in Wilson's disease using artificial neural networks

    Acta Neurol. Scand.

    (2005)
  • A.K. Jain et al.

    Statistical pattern recognition: a review

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • R. Jenssen, An information theoretic approach to machine learning, Ph.D. Thesis, Department of Physics, University of...
  • J.N. Kapur

    Measures of Information and their Application

    (1994)
  • T. Kohonen, Self-Organizing Maps, Springer Series in Information Sciences, vol. 30, Springer, Berlin, Heidelberg, 1995,...
  • Cited by (0)

    Thomas Villmann is a senior researcher at the Medical Department, University of Leipzig, Germany and leads the Computational Intelligence group (http://www.unileipzig.de/∼compint/). He holds a Ph.D. and the venia legendi in Computer Science. His research areas comprise the theory of prototype-based vector quantization, neural networks and machine learning as well as respective applications in medical data analysis, bioinformatics and satellite remote sensing. Several research stays have taken him to Belgium, France, the Netherlands, and the USA. He is a founding member of the German chapter of the European Neural Network Society (GNNS).

    Barbara Hammer received her Ph.D. in Computer Science in 1995 and her venia legendi in Computer Science in 2003, both from the University of Osnabrueck, Germany. From 2000 to 2004, she was leader of the junior research group ‘Learning with Neural Methods on Structured Data’ at University of Osnabrueck. In 2004, she became Professor for Theoretical Computer Science at Clausthal University of Technology, Germany. Several research stays have taken her to Italy, UK, India, France, and the USA. Her areas of expertise include hybrid systems, self-organizing maps, clustering, recurrent networks and their in bioinformatics, industrial process monitoring, or cognitive science.

    Frank-Michael Schleif studied Computer Science at Leipzig University, graduating in 2002. He then became a Ph.D. student at Leipzig University. In 2003, he joined Bruker Biosciences and continued his Ph.D. studies at the Clausthal University of Technology. In 2006, he received a Ph.D. in Computer Science and was awarded as the best C.S. Ph.D. Thesis at TUC. He is currently a research scientist in the MetaSTEM project team at IZKF (Interdisciplinary Center for Clinical Research) and a member of the Computational Intelligence Group at the Medical Department of the Leipzig University. His research activities focus on machine learning methods, statistical data analysis and algorithm development.

    Wieland Hermann is a neurologist and head of the neurology department at the Paracelsus Hospital, Zwickau. He holds a doctor degree and a venia legendi both from Leipzig University. His research area includes extra-pyramidal symptoms, Wilson's disease as well as related topics.

    Marie Cottrell was born in Béthune, France in 1943. She was a student at the Ecole Normale Supérieure de Sèvres, and received the Agrégation de Mathématiques degree in 1964 (with 8 place), and the Thèse d’Etat (Modélisations de réseaux de neurones par des chaıˆnes de Markov et autres applications.) in 1988. From 1964 to 1967, she was a high school teacher. From 1967 to 1988, she was successively an assistant and an assistant professor at the University of Paris and at the University of Paris-Sud (Orsay), except from 1970 to 1973, on which she was a professor at the University of Havana, Cuba. From 1989, she is a full professor at the University Paris 1-Panthéon-Sorbonne. Her research interests include stochastic algorithms, large deviation theory, biomathematics, data analysis, statistics. Since 1986, her main work deals with artificial and biological neural networks, Kohonen maps and their applications in data analysis. She is the author of about 70 publications in this field. She is in charge of a Research Group at the University Paris 1 (the SAMOS). She is regularly solicited as referee or international conference program committee member.

    View full text