Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data

Li, Qiude; Xiong, Qingyu; Ji, Shengfen; Gao, Min; Yu, Yang; Wu, Chao

doi:10.1007/s00500-019-04586-z

Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data

Methodologies and Application
Published: 04 December 2019

Volume 24, pages 10843–10863, (2020)
Cite this article

Soft Computing Aims and scope Submit manuscript

Qiude Li^1,2,3,
Qingyu Xiong ORCID: orcid.org/0000-0003-0976-2397^1,2,
Shengfen Ji⁴,
Min Gao^1,2,
Yang Yu^1,2 &
…
Chao Wu^1,2

515 Accesses
7 Citations
Explore all metrics

Abstract

Categorical attributes are ubiquitous in real-world collected data. However, such attributes lack a well-defined distance metric and cannot be directly manipulated per algebraic operations, so many data mining algorithms are unable to work directly on them. Learning an appropriate metric or an effective numerical embedding is very vital yet challenging, for categorical attributes with multi-view heterogeneous data characteristics. This paper proposes a novel multi-view heterogeneous fusion model (MVHF), which first captures basic coupling information for each view and then fuses these heterogeneous information from different views by multi-kernel metric learning, to measure the intrinsic distances between this type of categorical attributes; based on these measured distances, further, we use the manifold learning method to learn a high-quality numerical embedding for each categorical value. Experiments on 33 mixed data sets demonstrate that MVHF-enabled classification significantly enhances the performance, compared with state-of-the-art distance metrics or embedding competitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Feature dimensionality reduction: a review

Article Open access 21 January 2022

K-Means algorithm based on multi-feature-induced order

Article 09 April 2024

Multiple reference points-based multi-objective feature selection for multi-label learning

Article 11 April 2024

References

Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Google Scholar
Aitchison J, Aitken CG (1976) Multivariate binary discrimination by the kernel method. Biometrika 63(3):413–420
MathSciNet MATH Google Scholar
Alexandridis A, Chondrodima E, Giannopoulos N, Sarimveis H (2017) A fast and efficient method for training categorical radial basis function networks. IEEE Trans Neural Netw Learn Syst 28(11):2831–2836
Google Scholar
Bashon Y, Neagu D, Ridley MJ (2013) A framework for comparing heterogeneous objects: on the similarity measurements for fuzzy, numerical and categorical attributes. Soft Comput 17(9):1595–1615
Google Scholar
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Google Scholar
Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM international conference on data mining, SIAM, pp 243–254
Cao L (2015) Coupling learning of complex interactions. Inf Process Manag 51(2):167–186
Google Scholar
Cao F, Liang J, Li D, Bai L, Dang C (2012) A dissimilarity measure for the k-modes clustering algorithm. Knowl Based Syst 26:120–127
Google Scholar
Cerda P, Varoquaux G, Kégl B (2018) Similarity encoding for learning with dirty categorical variables. Mach Learn 107:1477–1494
MathSciNet MATH Google Scholar
Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513
MathSciNet Google Scholar
Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632
Google Scholar
Chen L, Wang S, Wang K, Zhu J (2016a) Soft subspace clustering of categorical data with probabilistic distance. Pattern Recognit 51:322–332
Google Scholar
Chen L, Ye Y, Guo G, Zhu J (2016b) Kernel-based linear classification on categorical data. Soft Comput 20(8):2981–2993
MATH Google Scholar
Cohen P, West SG, Aiken LS (2014) Applied multiple regression/correlation analysis for the behavioral sciences. Psychology Press, London
Google Scholar
Cox MAA, Cox TF (2001) Multidimensional scaling. J R Stat Soc 46(2):1050–1057
MATH Google Scholar
Croft WB, Metzler D, Strohman T (2010) Search engines: Information retrieval in practice, vol 283. Addison-Wesley, Reading
Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
MathSciNet MATH Google Scholar
Diab DM, El Hindi K (2018) Using differential evolution for improving distance measures of nominal values. Appl Soft Comput 64:14–34
Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository. School of Information and Computer Science, University of California, Irvine
Google Scholar
Golinko E, Sonderman T, Zhu X (2017) CNFL: categorical to numerical feature learning for clustering and classification. In: 2017 IEEE second international conference on data science in cyberspace (DSC). IEEE, pp 585–594
Guo C, Berkhahn F (2016) Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737
Hernández-Pereira E, Suárez-Romero JA, Fontenla-Romero O, Alonso-Betanzos A (2009) Conversion methods for symbolic features: a comparison applied to an intrusion detection problem. Expert Syst Appl 36(7):10612–10617
Google Scholar
Hsu CW, Chang CC, Lin CJ et al (2003) A practical guide to support vector classification
Ienco D, Pensa RG (2016) Positive and unlabeled learning in categorical data. Neurocomputing 196:113–124
Google Scholar
Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data (TKDD) 6(1):1
Google Scholar
Jain P, Kulis B, Dhillon IS (2010) Inductive regularized learning of kernel functions. In: Advances in neural information processing systems, pp 946–954
Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach Learn Res 13(Mar):519–547
MathSciNet MATH Google Scholar
Jia H, Cheung J, Liu J (2016) A new distance metric for unsupervised learning of categorical data. IEEE Trans Neural Netw Learn Syst 27(5):1065–1079
MathSciNet Google Scholar
Jian S, Cao L, Lu K, Gao H (2018a) Unsupervised coupled metric similarity for non-IID categorical data. IEEE Trans Knowl Data Eng 30:1810–1823
Google Scholar
Jian S, Pang G, Cao L, Lu K, Gao H (2018b) CURE: flexible categorical data representation by hierarchical coupling learning. IEEE Trans Knowl Data Eng 31:853–866
Google Scholar
Kasif S, Salzberg S, Waltz D, Rachlin J, Aha DW (1998) A probabilistic framework for memory-based reasoning. Artif Intell 104(1–2):287–311
MathSciNet MATH Google Scholar
Kim K, Js Hong (2017) A hybrid decision tree algorithm for mixed numeric and categorical data in regression analysis. Pattern Recognit Lett 98:39–45
Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
MathSciNet MATH Google Scholar
Le SQ, Ho TB (2005) An association-based dissimilarity measure for categorical data. Pattern Recognit Lett 26(16):2549–2557
Google Scholar
LeCun Y, Bottou L, Orr GB, Müller K (2012) Efficient backprop. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade, 2nd edn. Springer, Berlin, pp 9–48
Google Scholar
Li C, Jiang L, Li H, Wu J, Zhang P (2017a) Toward value difference metric with attribute weighting. Knowl Inf Syst 50(3):795–825
Google Scholar
Li Z, Nie F, Chang X, Yang Y (2017b) Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis. IEEE Trans Knowl Data Eng 29(10):2100–2110
Google Scholar
Li Q, Xiong Q, Ji S, Wen J, Gao M, Yu Y, Xu R (2019) Using fine-tuned conditional probabilities for data transformation of nominal attributes. Pattern Recognit Lett 128:107–114
Google Scholar
Müller B, Reinhardt J, Strickland MT (2012) Neural networks: an introduction. Springer, Berlin
MATH Google Scholar
Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281
MATH Google Scholar
Ng MK, Mark Junjie L, Joshua Zhexue H, Zengyou H (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507
Google Scholar
Ortakaya AF (2017) Independently weighted value difference metric. Pattern Recognit Lett 97:61–68
Google Scholar
Ouyang D, Li Q, Racine J (2006) Cross-validation and the estimation of probability distributions with categorical data. J Nonparametr Stat 18(1):69–100
MathSciNet MATH Google Scholar
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1986) Numerical recipes. The art of scientific computing. Cambridge University, London
MATH Google Scholar
Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
Google Scholar
Wang C, Dong X, Zhou F, Cao L, Chi CH (2015) Coupled attribute similarity learning on categorical data. IEEE Trans Neural Netw Learn Syst 26(4):781–797
MathSciNet Google Scholar
Wang H, Feng L, Liu Y (2016) Metric learning with geometric mean for similarities measurement. Soft Comput 20(10):3969–3979
Google Scholar
Zhang K, Wang Q, Chen Z, Marsic I, Kumar V, Jiang G, Zhang J (2015) From categorical to numerical: multiple transitive distance learning and embedding. In: Proceedings of the 2015 SIAM international conference on data mining. SIAM, pp 46–54
Zhao W, Li Q, Zhu C, Song J, Liu X, Yin J (2018) Model-aware categorical data embedding: a data-driven approach. Soft Comput 22:3603–3619
MATH Google Scholar
Zheng Q, Diao X, Cao J, Liu Y, Li H, Yao J, Chang C, Lv G (2019) From whole to part: reference-based representation for clustering categorical data. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2019.2911118
Article Google Scholar
Zhou ZH (2016) Machine learning. Tsinghua Press, Beijing
Google Scholar
Zhu C, Cao L, Liu Q, Yin J, Kumar V (2018) Heterogeneous metric learning of categorical data with hierarchical couplings. IEEE Trans Knowl Data Eng 30(7):1254–1267
Google Scholar

Download references

Acknowledgements

We thank anonymous reviewers for their valuable comments and suggestions. The work was supported by the Key Research Program of Chongqing Science & Technology Commission (Grant No. CSTC2017jcyjBX0025 and CSTC2019jscx-zdztzx0043), the Science and Technology Major Special Project of Guangxi (Grant No. GKAA17129002), the National Natural Science Foundations of China (Grant No. 61771077), and the National Key R&D Program of China (Grant No. 2018YFF0214706), Graduate Scientific Research and Innovation Foundation of Chongqing (Grant No. CYB19072 and CYS19028).

Author information

Authors and Affiliations

Key Laboratory of Dependable Service Computing in Cyber Physical Society, Chongqing University, Ministry of Education, Chongqing, China
Qiude Li, Qingyu Xiong, Min Gao, Yang Yu & Chao Wu
School of Big Data and Software Engineering, Chongqing University, Chongqing, 400044, China
Qiude Li, Qingyu Xiong, Min Gao, Yang Yu & Chao Wu
School of Biology and Engineering, Guizhou Medical University, Guiyang, 550004, Guizhou, China
Qiude Li
Foreign Language Teaching Center, Guizhou Institute of Technology, Guiyang, 550003, Guizhou, China
Shengfen Ji

Authors

Qiude Li
View author publications
You can also search for this author in PubMed Google Scholar
Qingyu Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Shengfen Ji
View author publications
You can also search for this author in PubMed Google Scholar
Min Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingyu Xiong.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Q., Xiong, Q., Ji, S. et al. Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data. Soft Comput 24, 10843–10863 (2020). https://doi.org/10.1007/s00500-019-04586-z

Download citation

Published: 04 December 2019
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00500-019-04586-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data

Abstract

Access this article

Similar content being viewed by others

Feature dimensionality reduction: a review

K-Means algorithm based on multi-feature-induced order

Multiple reference points-based multi-objective feature selection for multi-label learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-view heterogeneous fusion and embedding for categorical attributes on mixed data

Abstract

Access this article

Similar content being viewed by others

Feature dimensionality reduction: a review

K-Means algorithm based on multi-feature-induced order

Multiple reference points-based multi-objective feature selection for multi-label learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation