Skip to main content
Log in

On the Design of Robust Linear Pattern Classifiers Based on \(M\)-Estimators

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Classical linear neural network architectures, such as the optimal linear associative memory (OLAM) Kohonen and Ruohonen (IEEE Trans Comp 22(7):701–702, 1973) and the adaptive linear element (Adaline) Widrow (IEEE Signal Process Mag 22(1):100–106, 2005; Widrow and Winter (IEEE Comp 21(3):25–39, 1988), are commonly used either as a standalone pattern classifier for linearly separable problems or as a fundamental building block of multilayer nonlinear classifiers, such as the multilayer perceptron (MLP), the radial basis functions networks (RBFN), the extreme learning machine (ELM) (Int J Mach Learn Cyber 2:107–122, 2011) and the echo-state network (ESN) Emmerich (Proceedings of the 20th international conference on artificial neural networks, 148–153,  2010). A common feature shared by the learning equations of OLAM and Adaline, respectively, the ordinary least squares (OLS) and the least mean squares (LMS) algorithms, is that they are optimal only under the assumption of gaussianity of the errors. However, the presence of outliers in the data causes the error distribution to depart from gaussianity and hence the classifier performance deteriorates. Bearing this in mind, in this paper we develop simple and efficient extensions of OLAM and Adaline, named Robust OLAM (ROLAM) and Robust Adaline (Radaline), which are robust to labeling errors (a.k.a. label noise), a type of outlier that often occur in classification tasks. This type of outlier usually results from mistakes during labelling the data points (e.g. misjudgement of a specialist) or from typing errors during creation of data files (e.g. by striking an incorrect key on a keyboard). To deal with such outliers, the ROLAM and the Radaline use \(M\)-estimators to compute the weights of the OLAM and Adaline networks, instead of using standard OLS/LMS algorithms. By means of comprehensive computer simulations using synthetic and real-world data sets, we show that the proposed robust linear classifiers consistently outperforms their original versions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Also known as delta learning rule or the Widrow–Hoff learning rule [34].

  2. First component of \(\mathbf {x}_{n}\) is equal to 1 in order to include the bias.

  3. In other words, at iteration \(n\) or, equivalently, at the presentation of the \(n\)-th input pattern.

  4. The \(H_{\infty }\) criterion has been introduced, initially in the control theory literature, as a means to ensure robust performance in the face of model uncertainties and lack of statistical information on the exogenous signals.

  5. www.mathworks.com (Matlab) and www.R-project.org (R).

  6. www.4shared.com/zip/HCELAcCLce/Robust_linear_classifiers.html.

  7. Spondylolisthesis is the displacement of a vertebra or the vertebral column in relation to the vertebrae below.

References

  1. Akusok A, Veganzones D, Miche Y, Severin E, Lendasse A (2014) Finding originally mislabels with md-elm. In: Proceedings of the 22th european symposium on artificial neural networks, computational intelligence and machine learning (ESANN’2014), pp 689–694

  2. Alpaydin E, Jordan MI (1996) Local linear perceptrons for classification. IEEE Trans Neural Netw 7(3):788–792

    Article  Google Scholar 

  3. Anderson J (1972) A simple neural network generating an interactive memory. Math Biosci 14(3–4):197–220

    Article  Google Scholar 

  4. Ayad O (2014) Learning under concept drift with SVM. In: Proceedings of the 24th international conference on artificial neural networks (ICANN’2014), vol LNCS 8681, pp 587–594

  5. Bolzern P, Colaneri P, De Nicolao G (1999) H\(_\infty \)-robustness of adaptive filters against measurement noise and parameter drift. Automatica 35(9):1509–1520

    Article  Google Scholar 

  6. Chan SC, Zhou Y (2010) On the performance analysis of the least mean \({M}\)-estimate and normalized least mean \({M}\)-estimate algorithms with gaussian inputs and additive gaussian and contaminated gaussian noises. J Signal Process Syst 80(1):81–103

    Article  Google Scholar 

  7. Chatterjee S, Hadi AS (1986) Influential observations, high leverage points, and outliers in linear regression. Stat Sci 1(3):379–393

    Article  Google Scholar 

  8. Cherkassky V, Fassett K, Vassilas N (1991) Linear algebra approach to neural associative memories and noise performance of neural classifiers. IEEE Trans Comput 40(12):1429–1435

    Article  Google Scholar 

  9. Dasgupta S, Kalai AT, Monteleoni C (2009) Analysis of perceptron-based active learning. J Mach Learn Res 10:281–299

    Google Scholar 

  10. Duda RO, Hart PE, Stork DG (2006) Pattern classification, 2nd edn. Wiley, New York

    Google Scholar 

  11. Eichmann G, Kasparis T (1989) Pattern classification using a linear associative memory. Pattern Recogn 22(6):733–740

    Article  Google Scholar 

  12. Emmerich C, Reinhart F, Steil J (2010) Recurrence enhances the spatial encoding of static inputs in reservoir networks. In: Proceedings of the 20th international conference on artificial neural networks, vol LNCS 6353, Springer, pp 148–153

  13. Fox J (2002) An R and S-PLUS companion to applied regression. Sage Publications, Thousand Oaks

    Google Scholar 

  14. Frank A, Asuncion A (2010) UCI machine learning repository. URL http://archive.ics.uci.edu/ml

  15. Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

    Article  Google Scholar 

  16. Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296

    Article  Google Scholar 

  17. Frieß T-T, Harrison RF (1999) A kernel-based Adaline for function approximation. Intell Data Anal 3(4):307–313

    Article  Google Scholar 

  18. Golub GH, van Loan CF (1996) Matrix Comput, 3rd edn. Johns Hopkins University Press, Baltimore

    Google Scholar 

  19. Hassibi B, Sayed AH, Kailath T (1994) H\(_\infty \) optimality criteria for LMS and backpropagation. In: Cowan JD, Tesauro G, Alspector J (eds) Advances in neural information processing systems 6. morgan-kaufmann, San Mateo, pp 351–358

    Google Scholar 

  20. Hassibi B, Sayed AH, Kailath T (1996) H\(_\infty \) optimality of the LMS algorithm algorithm. IEEE Trans Signal Process 44(2):267–280

    Article  Google Scholar 

  21. Haykin S (2008) Neural networks and learning machines, 3rd edn. Prentice-Hall, New Jersey

    Google Scholar 

  22. Huang G-B, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Mach Learn Cybern 2:107–122

    Article  Google Scholar 

  23. Huber PJ (1964) Robust estimation of a location parameter. Annal Math Stat 35(1):73–101

    Article  Google Scholar 

  24. Huber PJ, Ronchetti EM (2009) Robust Stat. Wiley, New York

    Book  Google Scholar 

  25. Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4–5):411–430

    Article  Google Scholar 

  26. Kavak A, Yigit H, Ertunc HM (2005) Using Adaline neural network for performance improvement of smart antennas in tdd wireless communications. IEEE Trans Neural Netw 16(6):1616–1625

    Article  Google Scholar 

  27. Kim H-C, Ghahramani Z (2008) Outlier robust gaussian process classification. In: Proceedings of the 2008 joint IAPR international workshop on structural, syntactic, and statistical pattern recognition (SSPR)’08, pp 896–905

  28. Kohonen T (1989) Self-organization and associative memory. Springer-Verlag, Berlin

    Book  Google Scholar 

  29. Kohonen T, Ruohonen M (1973) Representation of associated data by matrix operators. IEEE Trans Comput 22(7):701–702

    Article  Google Scholar 

  30. Liu W, Pokharel P, Principe J (2008) The kernel least-mean-square algorithm. IEEE Trans Signal Process 56(2):543–554

    Article  Google Scholar 

  31. Nakano K (1972) Associatron: a model of associative memory. IEEE Trans Syst Man Cybern SMC–2(3):380–388

    Article  Google Scholar 

  32. Oja E (1992) Principal components, minor components and linear neural networks. Neural Netw 5:927–935

    Article  Google Scholar 

  33. Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78(9):1481–1497

    Article  Google Scholar 

  34. Principe JC, Euliano NR, Lefebvre WC (2000) Neural and adaptive systems: fundamentals through simulations. Wiley, New York

    Google Scholar 

  35. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York

    Book  Google Scholar 

  36. Stevens JP (1984) Outliers and influential data points in regression analysis. Psychol Bull 95(2):334–344

    Article  Google Scholar 

  37. Webb A (2002) Statistical pattern recognition, 2nd edn. Wiley, New York

    Book  Google Scholar 

  38. Widrow B (2005) Thinking about thinking: the discovery of the LMS algorithm. IEEE Signal Process Mag 22(1):100–106

    Article  Google Scholar 

  39. Widrow B, Kamenetsky M (2003) Statistical efficiency of adaptive algorithms. Neural Netw 16(5–6):735–744

    Article  Google Scholar 

  40. Widrow B, Winter R (1988) Neural nets for adaptive filtering and adaptive pattern recognition. IEEE Comput 21(3):25–39

    Article  Google Scholar 

  41. Williamson GA, Clarkson PM, Sethares WA (1993) Performance characteristics of the median LMS adaptive filter. IEEE Trans Signal Process 41(2):667–680

    Article  Google Scholar 

  42. Wu Y, Liu Y (2007) Robust truncated hinge loss support vector machines. J Am Stat Assoc 102(479):974–983

  43. Zhu X, Wu X (2004) Class noise versus attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210

    Article  Google Scholar 

  44. Zou Y, Chan SC, Ng TS (2000) Least mean \(M\)-estimate algorithms for robust adaptive filtering in impulsive noise. IEEE Trans Circuits Syst II 47(12):1564–1569

    Article  Google Scholar 

Download references

Acknowledgments

The authors thank CNPq (Grant 309841/2012-7) for the financial support and NUTEC (Fundação Núcleo de Tecnologia Industrial do Ceará) for providing the laboratory infrastructure for the execution of the research activities reported in this paper. We also thank Mr. César Lincoln Mattos and José Daniel Santos for the kind help in generating the results for the KLMS classifier.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guilherme A. Barreto.

Appendix

Appendix

By applying a nonlinear transformation in the input data, it is possible to obtain a nonlinear classifier from the same error function in Eq. (9). In a kernel context, the KLMS algorithm [30] operates on the feature space obtained by applying a mapping \(\Phi (\cdot )\) to the inputs, generating a new sequence of input-output pairs \(\{(\Phi (\varvec{x}_n), \mathbf {d}_n)\}_{n=1}^N\) [30]. Weight updating is similar to the LMS rule shown in Eq. (10):

$$\begin{aligned} \hat{\varvec{\beta }}_{i,n+1} = \hat{\varvec{\beta }}_{i,n} + \eta e_{in}\Phi (\varvec{x}_n). \end{aligned}$$
(28)

Considering \(\hat{{\varvec{\beta }}}_{i,0} = \varvec{0}\), where \(\varvec{0}\) is the null-vector, after \(N\) iterations we get

$$\begin{aligned} \hat{{\varvec{\beta }}}_{i,N}&= \mu \sum _{n=1}^{N-1} e_{in} \Phi (\varvec{x}_n), \end{aligned}$$
(29)
$$\begin{aligned} \hat{y}_{i,N}&= \hat{{\varvec{\beta }}}_{i,N}^T\Phi (\varvec{x}_N) = \mu \sum _{n=1}^{N-1} e_{in} \kappa (\varvec{x}_n, \varvec{x}_N), \end{aligned}$$
(30)

where \(\kappa (\varvec{x}_n, \varvec{x}_N) = \Phi (\varvec{x}_n)^T\Phi (\varvec{x}_N)\) is a positive-definite kernel function. It should be noted that only Eq. (30) is needed both for training and testing. Although the values of the weight vector do not need to be computed, the a priori errors \(e_{in}, n \in \{1, \cdots N\}\), and the training inputs \(\varvec{x}_n, n \in \{1, \cdots N\}\), must be maintained for prediction purposes.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barreto, G.A., Barros, A.L.B.P. On the Design of Robust Linear Pattern Classifiers Based on \(M\)-Estimators. Neural Process Lett 42, 119–137 (2015). https://doi.org/10.1007/s11063-014-9393-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-014-9393-2

Keywords

Navigation