Skip to main content
Log in

Kernel-based linear classification on categorical data

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Kernel-based methods have been widely investigated in the soft-computing community. However, they focus mainly on numeric data. In this paper, we propose a novel method for kernel learning on categorical data, and show how the method can be used to derive effective classifiers for linear classification. Based on kernel density estimation for categorical attributes, three popular classification methods, i.e., Naive Bayes, nearest neighbor and prototype-based classification, are effectively extended to classify categorical data. We also propose two data-driven approaches to the bandwidth selection problem, with one aimed at minimizing the mean squared error of the kernel estimate and the other endeavored to attribute weights optimization. Theoretical analysis indicates that, as in the numeric case, kernel learning of categorical attributes is capable to make the classes to be more separable, resulting in outstanding performances of the new classifiers on various real-world data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Aitchison J, Aitken C (1976) Multivariate binary discrimination by the kernel method. Biometrika 63:413–420

    Article  MathSciNet  MATH  Google Scholar 

  • Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of 8th SIAM international conference on data mining (SDM’08), pp 243–254

  • Buttrey SE (1998) Nearest-neighbor classification with categorical variables. Comput Stat Data Anal 28:157–169

    Article  MATH  Google Scholar 

  • Chen L (2015) A probabilistic framework for optimizing projected clusters with categorical attributes. Sci China Inf Sci 58:072104

  • Chen L, Guo G, Wang S, Kong X (2014) Kernel learning method for distance-based classification of categorical data. In: Proceedings of the 14th annual UK workshop on computational intelligence (UKCI’14), pp 58–63

  • Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27

    Article  MATH  Google Scholar 

  • Cristianini N, Scholkopf B (2002) Support vector machines and kernel methods: the new generation of learning machines. Artif Intell 23(3):31–41

    Google Scholar 

  • Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, New York

  • Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430

    Article  Google Scholar 

  • Hall M, Frank E et al (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

  • Han E, Karypis G (2000) Centroid-based document classification: analysis & experimental results. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD’00), pp 424–431

  • Hu Q, Yu D, Xie Z (2008) Neighborhood classifiers. Exp Syst Appl 34:876–886

    Article  Google Scholar 

  • Jiang L, Cai Z, Wang D, Zhang H (2014) Bayesian citation-KNN with distance weighting. Int J Mach Learn Cybern 5:193–199

    Article  Google Scholar 

  • John G, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI’95), pp 338–345

  • Lewis D (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: Proceedings of 10th European conference on machine learning (ECML’98), pp 4–15

  • Li Q, Racine J (2007) Nonparametric econometrics: theory and practice. Princeton University Press, Princeton

  • Li Q, Racine J (2008) Nonparametric estimation of conditional CDF and quantile functions with mixed categorical and continuous data. J Bus Econ Stat 26(4):423–434

    Article  MathSciNet  Google Scholar 

  • Light RJ, Marglin BH (1971) An analysis of variance for categorical data. J Am Stat Assoc 66(335):534–544

    Article  MathSciNet  MATH  Google Scholar 

  • Murphy K (2012) Machine learning: a probabilistic perspective. The MIT Press, New York

  • Ouyang D, Li Q, Racine J (2006) Cross-validation and the estimation of probability distributions with categorical data. Nonparametric Stat 18(1):69–100

    Article  MathSciNet  MATH  Google Scholar 

  • Paredes R, Vidal E (2006) Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Trans Pattern Anal Mach Intell 28:1100–1110

    Article  Google Scholar 

  • Seeger M (2006) Bayesian modeling in machine learning: a tutorial review. Tutorial, Saarland University. http://lapmal.epfl.ch/papers/bayes-review

  • Sen PK (2005) Gini diversity index, hamming distance and curse of dimensionality. Metron Int J Stat LXIII (3):329–349

  • Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

  • Vapnik V (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–1000

    Article  Google Scholar 

  • Weinberger K, Saul L (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  • Xiong T, Wang S, Mayers A, Monga E (2012) DHCC: divisive hierarchical clustering of categorical data. Data Min Knowl Discov 24(1):103–135

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang J, Chen L, Guo G (2013) Projected-prototype-based classifier for text categorization. Knowl Based Syst 49:179–189

    Article  Google Scholar 

Download references

Acknowledgments

L. Chen and G. Guo’s work was supported by the National Natural Science Foundation of China under Grant No. 61175123, and the Fujian Normal University Innovative Research Team (IRTL1207). L. Chen’s work was also supported by the Natural Science Foundation of Fujian Province of China under Grant No. 2015J01238. J. Zhu’s work was supported by the National Social Science Foundation of China (Major Program 13&ZD148).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lifei Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by D. Neagu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, L., Ye, Y., Guo, G. et al. Kernel-based linear classification on categorical data. Soft Comput 20, 2981–2993 (2016). https://doi.org/10.1007/s00500-015-1926-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-015-1926-8

Keywords

Navigation