Model-aware categorical data embedding: a data-driven approach

Zhao, Wentao; Li, Qian; Zhu, Chengzhang; Song, Jianglong; Liu, Xinwang; Yin, Jianping

doi:10.1007/s00500-018-3170-5

Model-aware categorical data embedding: a data-driven approach

Focus
Published: 27 April 2018

Volume 22, pages 3603–3619, (2018)
Cite this article

Soft Computing Aims and scope Submit manuscript

Wentao Zhao¹,
Qian Li ORCID: orcid.org/0000-0002-6149-5081²,
Chengzhang Zhu^1,2,
Jianglong Song¹,
Xinwang Liu¹ &
…
Jianping Yin³

375 Accesses
Explore all metrics

Abstract

Learning from categorical data is a critical yet challenging task. Current research focuses on either leveraging the complex interaction between and within categorical values to generate a numerical representation, or designing a model that can tackle this types of data directly. However, both of these paradigms overlook the relation between the data characteristics and learning model hypothesis. In this paper, we propose a model-aware categorical data embedding framework that jointly reveals the intrinsic categorical data characteristics and optimizes the fitness of the representation for the follow-up learning model. An ELM-aware and a SVM-aware representation methods have been instantiated under this framework. Extensive experiments of classification with the embedded representation on 17 data sets demonstrate that the proposed framework can significantly improve the categorical data representation performance compared with state-of-the-art competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Mixed-Type Data with Correlation-Preserving Embedding

Model-Aware Representation Learning for Categorical Data with Hierarchical Couplings

Conditional Probability-Based Feature Embedding for Genomic Sequence Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The meaning of symbol styles in this paper is as follows: element: lowercase with Sans Serif font; value: lowercase; vector: lowercase with bold font; matrix: uppercase with bold font; set: uppercase; function: lowercase with parentheses; space: uppercase with calligraphic font; value index: subscript; attribute index: superscript with parenthesis.
The data sets can be downloaded from: http://archive.ics.uci.edu/ml; https://www.sgi.com/tech/mlc/db; https://www.kaggle.com.

References

Ahmad A, Dey L (2007) A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recogn Lett 28(1):110–118
Article Google Scholar
Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: SIAM international conference on data mining. SIAM, pp 243–254
Cao L (2015) Coupling learning of complex interactions. Inf Process Manag 51(2):167–186
Article Google Scholar
Cao L, Ou Y, Philip SY (2012a) Coupled behavior analysis with applications. IEEE Trans Knowl Data Eng 24(8):1378–1392
Article Google Scholar
Cao F, Liang J, Li D, Bai L, Dang C (2012b) A dissimilarity measure for the k-modes clustering algorithm. Knowl Based Syst 26:120–127
Article Google Scholar
Cao W, Wang X, Ming Z, Gao J (2018) A review on neural networks with random weights. Neurocomputing 275:278–287
Article Google Scholar
Cheng V, Li C-H, Kwok JT, Li C-K (2004) Dissimilarity learning for nominal data. Pattern Recogn 37(7):1471–1477
Article Google Scholar
Cheung Y-M, Jia H (2013) Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recogn 46(8):2228–2238
Article MATH Google Scholar
Cuturi M, Avis D (2014) Ground metric learning. J Mach Learn Res 15(1):533–564
MathSciNet MATH Google Scholar
Ding S, Zhang N, Zhang J, Xinzheng X, Shi Z (2017) Unsupervised extreme learning machine with representational features. Int J Mach Learn Cybern 8(2):587–595
Article Google Scholar
Gärtner T, Lloyd JW, Flach PA (2004) Kernels and distances for structured data. Mach Learn 57(3):205–232
Article MATH Google Scholar
Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882–907
Article Google Scholar
Grabczewski K, Jankowski N (2003) Transformations of symbolic data for continuous data oriented models. In: Lecture notes in computer science, pp 359–366
He Y, Chen W, Chen Y, Mao Y (2013) Kernel density metric learning. In: 2013 IEEE 13th international conference on data mining (ICDM). IEEE, pp 271–280
Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1):489–501
Article Google Scholar
Huang G-B, Zhou H, Ding X (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B (Cybernetics) 42(2):513–529
Article Google Scholar
Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data 6(1):1–25
Article Google Scholar
Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13(Mar):519–547
MathSciNet MATH Google Scholar
Jia H, Cheung Y, Liu J (2016) A new distance metric for unsupervised learning of categorical data. IEEE Trans Neural Netw Learn Syst 27(5):1065–1079
Article MathSciNet Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article MathSciNet MATH Google Scholar
Le SQ, Ho TB (2005) An association-based dissimilarity measure for categorical data. Pattern Recogn Lett 26(16):2549–2557
Article Google Scholar
Lim D, Lanckriet G (2014) Efficient learning of mahalanobis metrics for ranking. In: International conference on machine learning, pp 1980–1988
Lim D, Lanckriet GRG, McFee B (2013) Robust structural metric learning. In: International conference on machine learning, pp 615–623
Liu M, Liu B, Zhang C, Wang W, Sun W (2017) Semi-supervised low rank kernel learning algorithm via extreme learning machine. Int J Mach Learn Cybern 8(3):1039–1052
Article Google Scholar
Liu W, Mu C, Ji R, Ma S, Smith JR, Chang S-F (2015) Low-rank similarity metric learning in high dimensions. In: Twenty-ninth AAAI conference on artificial intelligence, pp 2792–2799
Mao W, Wang J, Xue Z (2017) An elm-based model with sparse-weighting strategy for sequential data imbalance problem. Int J Mach Learn Cybern 8(4):1333–1345
Article Google Scholar
Ng M, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507
Article Google Scholar
Peng S, Hu Q, Chen Y, Dang J (2015) Improved support vector machine algorithm for heterogeneous data. Pattern Recogn 48(6):2072–2083
Article MATH Google Scholar
Shi Y, Li W, Sha F (2016) Metric learning for ordinal data. In: Thirtieth AAAI conference on artificial intelligence. AAAI Press, pp 2030–2036
Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
Article Google Scholar
van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(Nov):2579–2605
MATH Google Scholar
Vapnik VN (1998) Statistical learning theory, vol 1. Wiley, New York
MATH Google Scholar
Wang C, She Z, Cao L (2013) Coupled attribute analysis on numerical data. In: Twenty-third international joint conference on artificial intelligence, pp 1736–1742
Wang C, Dong X, Zhou F, Cao L, Chi CH (2015) Coupled attribute similarity learning on categorical data. IEEE Trans Neural Netw Learn Syst 26(4):781
Article MathSciNet Google Scholar
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10(Feb):207–244
MATH Google Scholar
Wilson RD, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 6(1):1–34
MathSciNet MATH Google Scholar
Xie J, Szymanski BK, Zaki MJ (2013) Learning dissimilarities for categorical symbols. In: JMLR: workshop on feature selection in data mining. JMLR.org, pp 2228–2238
Xue J, Zhou SH, Liu Q, Liu X, Yin J (2017) Financial time series prediction using l2, 1rf-elm. Neurocomputing 277:176–186
Xue J, Liu Q, Li M, Liu X, Ye Y, Wang S, Yin J (2018) Incremental multiple kernel extreme learning machine and its application in Robo-advisors. Soft Computing. https://doi.org/10.1007/s00500-018-3031-2
Ye H-J, Zhan D-C, Jiang Y (2016) Instance specific metric subspace learning: a bayesian approach. In: Thirtieth AAAI conference on artificial intelligence, pp 2272–2278
Ying Y, Li P (2012) Distance metric learning with eigenvalue optimization. J Mach Learn Res 13(Jan):1–26
MathSciNet MATH Google Scholar
Zhai J, Zhang S, Wang C (2017) The classification of imbalanced large data sets based on mapreduce and ensemble of elm classifiers. Int J Mach Learn Cybern 8(3):1009–1017
Article Google Scholar
Zhang K, Wang Q, Chen Z, Marsic I, Kumar V, Jiang G, Zhang J (2015) From categorical to numerical: Multiple transitive distance learning and embedding. In: SIAM international conference on data mining. SIAM, pp 46–54
Zhu C, Cao L, Liu Q, Yin J, Kumar V (2018) Heterogeneous metric learning of categorical data with hierarchical couplings. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2018.2791525

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 61672528 and 61773392.

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, China
Wentao Zhao, Chengzhang Zhu, Jianglong Song & Xinwang Liu
FEIT, University of Technology Sydney, Ultimo, Australia
Qian Li & Chengzhang Zhu
Dongguan University of Technology, Dongguan, China
Jianping Yin

Authors

Wentao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Qian Li
View author publications
You can also search for this author in PubMed Google Scholar
Chengzhang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jianglong Song
View author publications
You can also search for this author in PubMed Google Scholar
Xinwang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianping Yin.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by X. Wang, A.K. Sangaiah, M. Pelillo.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, W., Li, Q., Zhu, C. et al. Model-aware categorical data embedding: a data-driven approach. Soft Comput 22, 3603–3619 (2018). https://doi.org/10.1007/s00500-018-3170-5

Download citation

Published: 27 April 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s00500-018-3170-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-aware categorical data embedding: a data-driven approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering Mixed-Type Data with Correlation-Preserving Embedding

Model-Aware Representation Learning for Categorical Data with Hierarchical Couplings

Conditional Probability-Based Feature Embedding for Genomic Sequence Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Ethical approval

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Model-aware categorical data embedding: a data-driven approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Clustering Mixed-Type Data with Correlation-Preserving Embedding

Model-Aware Representation Learning for Categorical Data with Hierarchical Couplings

Conditional Probability-Based Feature Embedding for Genomic Sequence Data

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Ethical approval

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation