Kernel-based linear classification on categorical data

Chen, Lifei; Ye, Yanfang; Guo, Gongde; Zhu, Jianping

doi:10.1007/s00500-015-1926-8

Kernel-based linear classification on categorical data

Focus
Published: 05 November 2015

Volume 20, pages 2981–2993, (2016)
Cite this article

Soft Computing Aims and scope Submit manuscript

Lifei Chen¹,
Yanfang Ye²,
Gongde Guo¹ &
…
Jianping Zhu³

548 Accesses
Explore all metrics

Abstract

Kernel-based methods have been widely investigated in the soft-computing community. However, they focus mainly on numeric data. In this paper, we propose a novel method for kernel learning on categorical data, and show how the method can be used to derive effective classifiers for linear classification. Based on kernel density estimation for categorical attributes, three popular classification methods, i.e., Naive Bayes, nearest neighbor and prototype-based classification, are effectively extended to classify categorical data. We also propose two data-driven approaches to the bandwidth selection problem, with one aimed at minimizing the mean squared error of the kernel estimate and the other endeavored to attribute weights optimization. Theoretical analysis indicates that, as in the numeric case, kernel learning of categorical attributes is capable to make the classes to be more separable, resulting in outstanding performances of the new classifiers on various real-world data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast feature selection for interval-valued data through kernel density estimation entropy

Article 07 May 2020

Formulation of Two Stage Multiple Kernel Learning Using Regression Framework

Kernel Matrix Regularization via Shrinkage Estimation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aitchison J, Aitken C (1976) Multivariate binary discrimination by the kernel method. Biometrika 63:413–420
Article MathSciNet MATH Google Scholar
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of 8th SIAM international conference on data mining (SDM’08), pp 243–254
Buttrey SE (1998) Nearest-neighbor classification with categorical variables. Comput Stat Data Anal 28:157–169
Article MATH Google Scholar
Chen L (2015) A probabilistic framework for optimizing projected clusters with categorical attributes. Sci China Inf Sci 58:072104
Chen L, Guo G, Wang S, Kong X (2014) Kernel learning method for distance-based classification of categorical data. In: Proceedings of the 14th annual UK workshop on computational intelligence (UKCI’14), pp 58–63
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Article MATH Google Scholar
Cristianini N, Scholkopf B (2002) Support vector machines and kernel methods: the new generation of learning machines. Artif Intell 23(3):31–41
Google Scholar
Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, New York
Guo G, Wang H, Bell D, Bi Y, Greer K (2006) Using kNN model for automatic text categorization. Soft Comput 10(5):423–430
Article Google Scholar
Hall M, Frank E et al (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Han E, Karypis G (2000) Centroid-based document classification: analysis & experimental results. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD’00), pp 424–431
Hu Q, Yu D, Xie Z (2008) Neighborhood classifiers. Exp Syst Appl 34:876–886
Article Google Scholar
Jiang L, Cai Z, Wang D, Zhang H (2014) Bayesian citation-KNN with distance weighting. Int J Mach Learn Cybern 5:193–199
Article Google Scholar
John G, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the conference on uncertainty in artificial intelligence (UAI’95), pp 338–345
Lewis D (1998) Naive (bayes) at forty: the independence assumption in information retrieval. In: Proceedings of 10th European conference on machine learning (ECML’98), pp 4–15
Li Q, Racine J (2007) Nonparametric econometrics: theory and practice. Princeton University Press, Princeton
Li Q, Racine J (2008) Nonparametric estimation of conditional CDF and quantile functions with mixed categorical and continuous data. J Bus Econ Stat 26(4):423–434
Article MathSciNet Google Scholar
Light RJ, Marglin BH (1971) An analysis of variance for categorical data. J Am Stat Assoc 66(335):534–544
Article MathSciNet MATH Google Scholar
Murphy K (2012) Machine learning: a probabilistic perspective. The MIT Press, New York
Ouyang D, Li Q, Racine J (2006) Cross-validation and the estimation of probability distributions with categorical data. Nonparametric Stat 18(1):69–100
Article MathSciNet MATH Google Scholar
Paredes R, Vidal E (2006) Learning weighted metrics to minimize nearest-neighbor classification error. IEEE Trans Pattern Anal Mach Intell 28:1100–1110
Article Google Scholar
Seeger M (2006) Bayesian modeling in machine learning: a tutorial review. Tutorial, Saarland University. http://lapmal.epfl.ch/papers/bayes-review
Sen PK (2005) Gini diversity index, hamming distance and curse of dimensionality. Metron Int J Stat LXIII (3):329–349
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Vapnik V (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–1000
Article Google Scholar
Weinberger K, Saul L (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
MATH Google Scholar
Xiong T, Wang S, Mayers A, Monga E (2012) DHCC: divisive hierarchical clustering of categorical data. Data Min Knowl Discov 24(1):103–135
Article MathSciNet MATH Google Scholar
Zhang J, Chen L, Guo G (2013) Projected-prototype-based classifier for text categorization. Knowl Based Syst 49:179–189
Article Google Scholar

Download references

Acknowledgments

L. Chen and G. Guo’s work was supported by the National Natural Science Foundation of China under Grant No. 61175123, and the Fujian Normal University Innovative Research Team (IRTL1207). L. Chen’s work was also supported by the Natural Science Foundation of Fujian Province of China under Grant No. 2015J01238. J. Zhu’s work was supported by the National Social Science Foundation of China (Major Program 13&ZD148).

Author information

Authors and Affiliations

School of Mathematics and Computer Science, and Fujian Provincial Key Laboratory of Network Security and Cryptology, Fujian Normal University, Fuzhou, 350117, Fujian, China
Lifei Chen & Gongde Guo
Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV, 26506, USA
Yanfang Ye
School of Management, Data-Mining Research Center, Xiamen University, Xiamen, 361005, China
Jianping Zhu

Authors

Lifei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yanfang Ye
View author publications
You can also search for this author in PubMed Google Scholar
Gongde Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lifei Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by D. Neagu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, L., Ye, Y., Guo, G. et al. Kernel-based linear classification on categorical data. Soft Comput 20, 2981–2993 (2016). https://doi.org/10.1007/s00500-015-1926-8

Download citation

Published: 05 November 2015
Issue Date: August 2016
DOI: https://doi.org/10.1007/s00500-015-1926-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Kernel-based linear classification on categorical data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fast feature selection for interval-valued data through kernel density estimation entropy

Formulation of Two Stage Multiple Kernel Learning Using Regression Framework

Kernel Matrix Regularization via Shrinkage Estimation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Kernel-based linear classification on categorical data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fast feature selection for interval-valued data through kernel density estimation entropy

Formulation of Two Stage Multiple Kernel Learning Using Regression Framework

Kernel Matrix Regularization via Shrinkage Estimation

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation