Skip to main content
Log in

Electrostatic field framework for supervised and semi-supervised learning from incomplete data

  • Published:
Natural Computing Aims and scope Submit manuscript

Abstract

In this paper a classification framework for incomplete data, based on electrostatic field model is proposed. An original approach to exploiting incomplete training data with missing features, involving extensive use of electrostatic charge analogy, has been used. The framework supports a hybrid supervised and unsupervised training scenario, enabling learning simultaneously from both labelled and unlabelled data using the same set of rules and adaptation mechanisms. Classification of incomplete patterns has been facilitated by introducing a local dimensionality reduction technique, which aims at exploiting all available information using the data ‘as is’, rather than trying to estimate the missing values. The performance of all proposed methods has been extensively tested in a wide range of missing data scenarios, using a number of standard benchmark datasets in order to make the results comparable with those available in current and future literature. Several modifications to the original Electrostatic Field Classifier aiming at improving speed and robustness in higher dimensional spaces have also been introduced and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. We use the term ‘sample’ to refer to a single object/instance and not to the whole dataset, which is common in statistics literature.

  2. Deficiency level is the level of missingness of a dataset, with 0 for complete data and 1 for maximally incomplete data, taking into account the constraints given.

References

  • Aggarwal C (2001) Re-designing distance functions and distance-based applications for high dimensional data. ACM SIGMOD Rec 30(1):13–18

    Article  Google Scholar 

  • Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. Lect Notes Comput Sci 2001:420–435

    Article  Google Scholar 

  • Asuncion A, Newman D (2007) UCI machine learning repository

  • Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful. Lect Notes Comput Sci 1540:217–235

    Article  Google Scholar 

  • Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory. ACM, New York, NY, USA, pp 92–100

  • Budka M, Gabrys B (2009) Electrostatic field classifier for deficient data. In: Computer recognition systems 3: Proceedings of 6th international conference on computer recognition systems cores 09. Springer, pp 311–318

  • Chuang I, Nielsen M (2000) Quantum information. Cambridge University Press

  • Dara R, Kremer S, Stacey D (2002) Clustering unlabeled data with SOMs improves classification of labeled real-world data. In: Neural networks, 2002. IJCNN’02. Proceedings of the 2002 international joint conference on, vol 3, pp 2237–2242

  • Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • Duin R, Juszczak P, Paclik P, Pekalska E, de Ridder D, Tax D, Verzakov S (2007) Pr-tools 4.1, a matlab toolbox for pattern recognition. http://prtools.org

  • Francois D, Wertz V, Verleysen M (2005) Non-Euclidean metrics for similarity search in noisy datasets. In: Proceedings of the European symposium on artificial neural networks, pp 339–334

  • Gabrys B (2002) Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems. Int J Approx Reason 30(3):149–179

    Article  MathSciNet  MATH  Google Scholar 

  • Gabrys B, Petrakieva L (2004) Combining labelled and unlabelled data in the design of pattern classification systems. Int J Approx Reason 35(3):251–273

    Article  MathSciNet  MATH  Google Scholar 

  • Ghahramani Z, Jordan M, Cowan J, Tesauro G, Alspector J (1994) Supervised learning from incomplete data via an EM approach. Adv Neural Inf Process Syst 6:120–127

    Google Scholar 

  • Graham J, Cumsille P, Elek-Fisk E (2003) Methods for handling missing data. Handb Psychol 2:87–114

    Google Scholar 

  • Hakkoymaz H, Chatzimilioudis G, Gunopulos D, Mannila H (2009) Applying electromagnetic field theory concepts to clustering with constraints. In: Proceedings of the European conference on machine learning and knowledge discovery in databases: part I. Springer, p 500

  • Hild K, Erdogmus D, Principe J (2001) Blind source separation using Renyi’s mutual information. IEEE Signal Process Lett 8(6):174–176

    Article  Google Scholar 

  • Hochreiter S, Mozer M (2001) Coulomb classifiers: reinterpreting SVMs as electrostatic systems. Technical report CU-CS-921-01. Department of Computer Science, University of Colorado, Boulder

  • Hochreiter S, Mozer M, Obermayer K (2003) Coulomb classifiers: generalizing support vector machines via an analogy to electrostatic systems. Adv Neural Inf Process Syst 15:545–552

    Google Scholar 

  • Jenssen R, Eltoft T, Erdogmus D, Principe J (2006) Some equivalences between kernel methods and information theoretic methods. J VLSI Signal Process 45(1):49–65

    Article  Google Scholar 

  • Kothari R, Jain V (2002) Learning from labeled and unlabeled data. In: Neural networks, 2002. IJCNN’02. Proceedings of the 2002 international joint conference on, vol 3

  • Kuncheva L (2000) Fuzzy classifier design. Physica Verlag

  • Loss D, DiVincenzo D (1998) Quantum computation with quantum dots. Phys Rev A 57(1):120–126

    Article  Google Scholar 

  • Madow W, Olkin I (1983) Incomplete data in sample surveys, vol 3, Proceedings of the symposium. Academic Press, New York

  • Mitchell T (1999) The role of unlabeled data in supervised learning. In: Proceedings of the sixth international colloquium on cognitive science

  • Nigam K, Ghani R (2000) Understanding the behavior of co-training. In: Proceedings of KDD-2000 workshop on text mining

  • Outhwaite W, Turner SP (2007) Handbook of social science methodology. SAGE Publications Ltd

  • Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern B 27(5):787–795

    Article  Google Scholar 

  • Principe J, Xu D, Fisher J (2000a) Information theoretic learning, chapter 7. Wiley, New York, pp 265–319

  • Principe J, Xu D, Zhao Q, Fisher J (2000b) Learning from examples with information theoretic criteria. J VLSI Signal Process 26(1):61–77

    Article  MATH  Google Scholar 

  • Ripley B (1996) Pattern recognition and neural networks. Cambridge University Press

  • Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the 18th international conference on machine learning, pp 441–448

  • Rubin D (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin D (1987) Multiple imputation for nonresponse in surveys. Wiley-Interscience

  • Ruta D, Gabrys B (2003) Physical field models for pattern classification. Soft Comput 8(2):126–141

    MathSciNet  Google Scholar 

  • Ruta D, Gabrys B (2005) Nature inspired learning models. In: Proceedings of the European symposium on nature inspired smart information systems, Albufeira, Portugal

  • Ruta D, Gabrys B (2009) A framework for machine learning based on dynamic physical fields. Nat Comput 8(2):219–237

    Article  MathSciNet  MATH  Google Scholar 

  • Sarle W (1998) Prediction with missing inputs. JCIS 98:399–402

    Google Scholar 

  • Schafer J, Graham J (2002) Missing data: our view of the state of the art. Psychol Methods 7(2):147–177

    Article  Google Scholar 

  • Schafer J, Schenker N (2000) Inference with imputed conditional means. J Am Stat Assoc 95(449):144–154

    Article  MathSciNet  MATH  Google Scholar 

  • Sg SG, Goldman S, Zhou Y (2000) Enhancing supervised learning with unlabeled data. Proceedings of the 17th international conference on machine learning, pp 327–334

  • Torkkola K (2003) Feature extraction by non parametric mutual information maximization. J Mach Learn Res 3:1415–1438

    Article  MathSciNet  MATH  Google Scholar 

  • Tresp V, Ahmad S, Neuneier R (1994) Training neural networks with deficient data. Adv Neural Inf Process Syst 6:128–135

    Google Scholar 

  • Walther P, Resch K, Rudolph T, Schenck E, Weinfurter H, Vedral V, Aspelmeyer M, Zeilinger A (2005) Experimental one-way quantum computing. Nature 434:169–176

    Article  Google Scholar 

  • Zurek W (1989) Complexity, entropy and the physics of information. Westview Press

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcin Budka.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Budka, M., Gabrys, B. Electrostatic field framework for supervised and semi-supervised learning from incomplete data. Nat Comput 10, 921–945 (2011). https://doi.org/10.1007/s11047-010-9182-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11047-010-9182-4

Keywords

Navigation