Abstract
Compositional data have posed challenges to current classification methods owing to the non-negative and unit-sum constraints, especially when a certain of the components are zeros. In this paper, we develop an effective classification method for multivariate compositional data with certain of the components equal to zero. Specifically, a Kent feature embedding technique is first proposed to transform compositional data and improve data quality. We then use support vector machine as the state-of-the-art machine learning model to build the classifier. The proposed method is proved to be effective through numerical simulations. Results on multiple real datasets, including species classification, day-night image classification and household’s consumption pattern recognition, further verify that the proposed method can achieve good classification performance and outperform the other competitors. This method would help to broaden the practical usage of compositional data with zeros in the task of classification.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availibility statement
Data are public available and details are given in the paper. Data can also be made available on reasonable request.
Notes
To explicitly showcase the proposed Kent feature embedding, the corresponding pseudocode is depicted in Algorithm 2, conveniently placed in the Appendix to maintain the paper’s conciseness.
References
An, W., Liang, M.: A new intrusion detection method based on svm with minimum within-class scatter. Secur. Commun. Netw. 6(9), 1064–1074 (2013). https://doi.org/10.1002/sec.666
Armanfard, N., Reilly, J.P., Komeili, M.: Local feature selection for data classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(6), 1217–1227 (2016). https://doi.org/10.1109/TPAMI.2015.2478471
Bello, M., Nápoles, G., Vanhoof, K., Bello, R.: Data quality measures based on granular computing for multi-label classification. Inf. Sci. 560, 51–67 (2021). https://doi.org/10.1016/j.ins.2021.01.027
Cuesta-Albertos, J.A., Cuevas, A., Fraiman, R.: On projection-based tests for directional and compositional data. Stat. Comput. 19(4), 367 (2009). https://doi.org/10.1007/s11222-008-9098-3
Fan, J., Feng, Y., Jiang, J., Tong, X.: Feature augmentation via nonparametrics and selection (fans) in high-dimensional classification. J. Am. Stat. Assoc. 111(513), 275–287 (2016). https://doi.org/10.1080/01621459.2015.1005212
Fiksel, J., Zeger, S., Datta, A.: A transformation-free linear regression for compositional outcomes and predictors. Biometrics (2021). https://doi.org/10.1111/biom.13465
Filzmoser, P., Hron, K., Templ, M.: Discriminant analysis for compositional data and robust parameter estimation. Comput. Stat. 27(4), 585–604 (2012). https://doi.org/10.1007/s00180-011-0279-8
Fry, J.M., Fry, T.R., McLaren, K.R.: Compositional data analysis and zeros in micro data. Appl. Econ. 32(8), 953–959 (2000)
Gou, J., Sun, L., Du, L., Ma, H., Xiong, T., Ou, W., Zhan, Y.: A representation coefficient-based k-nearest centroid neighbor classifier. Expert Syst. Appl. 194, 116529 (2022). https://doi.org/10.1016/j.eswa.2022.116529
Greenacre, M., Grunsky, E., Bacon-Shone, J., Erb, I., Quinn, T.: Aitchison’s compositional data analysis 40 years on: a reappraisal. Stat. Sci. 38(3), 386–410 (2023)
Gu, J., Wang, L., Wang, H., Wang, S.: A novel approach to intrusion detection using svm ensemble with feature augmentation. Comput. Secur. 86, 53–62 (2019). https://doi.org/10.1016/j.cose.2019.05.022
Gu, J., Cui, B., Lu, S.: A classification framework for multivariate compositional data with dirichlet feature embedding. Knowl.-Based Syst. 212, 106614 (2021). https://doi.org/10.1016/j.knosys.2020.106614
Huang, X., Shi, L., Suykens, J.A.K.: Support vector machine classifier with pinball loss. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 984–997 (2014). https://doi.org/10.1109/TPAMI.2013.178
Jiménez-Cordero, A., Morales, J.M., Pineda, S.: A novel embedded min-max approach for feature selection in nonlinear support vector machine classification. Eur. J. Oper. Res. 293(1), 24–35 (2021). https://doi.org/10.1016/j.ejor.2020.12.009
Kaiser, M., Klier, M., Heinrich, B.: How to measure data quality?-a metric-based approach. ICIS 2007 Proceedings, page 108, 2007
Kalivodová, A., Hron, K., Filzmoser, P., Najdekr, L., Janečková, H., Adam, T.: Pls-da for compositional data with application to metabolomics. J. Chemom. 29(1), 21–28 (2015). https://doi.org/10.1002/cem.2657
Kent, J.T.: The fisher-bingham distribution on the sphere. J. Roy. Stat. Soc.: Ser. B (Methodol.) 44(1), 71–80 (1982). https://doi.org/10.1111/j.2517-6161.1982.tb01189.x
Kovács, L., Kovács, G., Martín-Fernández, J. A., Barceló-Vidal, C.: Major-oxide compositional discrimination in cenozoic volcanites of hungary. In Buccianti, A., Mateu-Figueras, G. and Pawlowsky-Glahn, V. editors, Compositional data analysis in the geosciences: from theory to practice, pages 11–23. Geological Society, London, 2006. https://doi.org/10.1144/GSL.SP.2006.264.01.02
Kume, A., Walker, S.G.: Sampling from compositional and directional distributions. Stat. Comput. 16(3), 261–265 (2006). https://doi.org/10.1007/s11222-006-8077-9
Kume, A., Wood, A.T.: Saddlepoint approximations for the bingham and fisher-bingham normalising constants. Biometrika 92(2), 465–476 (2005). https://doi.org/10.1093/biomet/92.2.465
Lavanya, P., Kouser, K., Suresha, M.: Effective feature representation using symbolic approach for classification and clustering of big data. Expert Syst. Appl. 173, 114658 (2021). https://doi.org/10.1016/j.eswa.2021.114658
Li, Y., Chai, Y., Zhou, H., Yin, H.: A novel dimension reduction and dictionary learning framework for high-dimensional data classification. Pattern Recogn. 112, 107793 (2021). https://doi.org/10.1016/j.patcog.2020.107793
Li, Y., Zhu, L., Wang, H., Yu, F.R., Liu, S.: A cross-layer defense scheme for edge intelligence-enabled cbtc systems against mitm attacks. IEEE Trans. Intell. Transp. Syst. 22(4), 2286–2298 (2021). https://doi.org/10.1109/TITS.2020.3030496
Liu, P., Tian, G.-L., Yuen, K.C., Sun, Y., Zhang, C.: Compositional inverse gaussian models with applications in compositional data analysis with possible zero observations. J. Stat. Comput. Simul. (2023). https://doi.org/10.1080/00949655.2023.2242550
Lu, S., Zhao, J., Wang, H.: Md-mbpls: a novel explanatory model in computational social science. Knowl.-Based Syst. 223, 107023 (2021)
Lunga, D., Ersoy, O.: Kent mixture model for classification of remote sensing data on spherical manifolds. In 2011 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pages 1–7. IEEE, 2011
Maji, S., Berg, A.C., Malik, J.: Efficient classification for additive kernel svms. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 66–77 (2013). https://doi.org/10.1109/TPAMI.2012.62
Martín-Fernández, J.A., Barceló-Vidal, C., Pawlowsky-Glahn, V.: Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math. Geol. 35, 253–278 (2003)
Matuk, J., Bharath, K., Chkrebtii, O., Kurtek, S.: Bayesian framework for simultaneous registration and estimation of noisy, sparse, and fragmented functional data. J. Am. Stat. Assoc. (2021). https://doi.org/10.1080/01621459.2021.1893179
Napier, G., Neocleous, T., Nobile, A.: A composite bayesian hierarchical model of compositional data with zeros. J. Chemom. 29(2), 96–108 (2015). https://doi.org/10.1002/cem.2681
Neocleous, T., Aitken, C., Zadora, G.: Transformations for compositional data with zeros with an application to forensic evidence evaluation. Chemom. Intell. Lab. Syst. 109, 77–85 (2011). https://doi.org/10.1016/j.chemolab.2011.08.003
Paine, P., Preston, S.P., Tsagris, M., Wood, A.T.: An elliptically symmetric angular gaussian distribution. Stat. Comput. 28(3), 689–697 (2018)
Pandolfo, G., D’Ambrosio, A.: Depth-based classification of directional data. Expert Syst. Appl. 169, 114433 (2021). https://doi.org/10.1016/j.eswa.2020.114433
Pawlowsky-Glahn, V., Buccianti, A.: editors. Compositional data analysis: Theory and applications. John Wiley & Sons, 2011a
Pawlowsky-Glahn, V., Buccianti, A.: Compositional data analysis. Wiley Online Library, 2011b
Peng, Q., Lin, X., Shi, H., Bao, J., Li, X., Zhuang, Y.: A support vector machine classification-based signal detection method in ultrahigh-frequency radio frequency identification systems. IEEE Trans. Industr. Inf. 17(7), 4646–4656 (2021). https://doi.org/10.1016/10.1109/TII.2020.3015241
Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002)
Rasmussen, C.L., Palarea-Albaladejo, J., Johansson, M.S., Crowley, P., Stevens, M.L., Gupta, N., Karstad, K., Holtermann, A.: Zero problems with compositional data of physical behaviors: a comparison of three zero replacement methods. Int. J. Behav. Nutr. Phys. Act. 17, 126 (2020). https://doi.org/10.1186/s12966-020-01029-z
Rieser, C., Filzmoser, P.: Extending compositional data analysis from a graph signal processing perspective. J. Multiv. Anal. 198, 105209 (2023). https://doi.org/10.1016/j.jmva.2023.105209
Scealy, J., Welsh, A.H.: Fitting kent models to compositional data with small concentration. Stat. Comput. 24(2), 165–179 (2014). https://doi.org/10.1007/s11222-012-9361-5
Scealy, J., de Caritat, P., Grunsky, E.C., Tsagris, M.T., Welsh, A.: Robust principal component analysis for power transformed compositional data. J. Am. Stat. Assoc. 110(509), 136–148 (2015). https://doi.org/10.1080/01621459.2014.990563
Scealy, J.L., Welsh, A.H.: Regression for compositional data by using distributions defined on the hypersphere. J. Royal Stat. Soc. Ser. B-Stat. Methodol. 73(3), 351–375 (2011). https://doi.org/10.1111/j.1467-9868.2010.00766.x
Stephens, M.A.: Use of the von mises distribution to analyse continuous proportions. Biometrika 69(1), 197–203 (1982). https://doi.org/10.1093/biomet/69.1.197
Stewart, C., Field, C.: Managing the essential zeros in quantitative fatty acid signature analysis. J. Agric. Biol. Environ. Stat. 16(1), 45–69 (2011). https://doi.org/10.1007/s13253-010-0040-8
Taghia, J., Ma, Z., Leijon, A.: Bayesian estimation of the von-mises fisher mixture model with variational inference. IEEE Trans. Pattern Anal. Mach. Intell. 36(9), 1701–1715 (2014). https://doi.org/10.1109/TPAMI.2014.2306426
Templ, M., Hron, K., Filzmoser, P.: Exploratory tools for outlier detection in compositional data with structural zeros. J. Appl. Stat. 44(4), 734–752 (2017). https://doi.org/10.1080/02664763.2016.1182135
Tian-Tsong, N., Shih-Fu, C., Jessie, H., Martin, P.: Columbia photographic images and photorealistic computer graphics dataset. Technical Report 205-2004-5, ADVENT, Columbia University, 2004
Tsagris, M., Preston, S., Wood, A.T.: Improved classification for compositional data using the \(\alpha \)-transformation. J. Classif. 33(2), 243–261 (2016). https://doi.org/10.1007/s00357-016-9207-5
Tsilimigras, M.C., Fodor, A.A.: Compositional data analysis of the microbiome: fundamentals, tools, and challenges. Ann. Epidemiol. 26(5), 330–335 (2016)
von Eynatten, H., Barceló-Vidal, C., Pawlowsky-Glahn, V.: Composition and discrimination of sandstones: a statistical evaluation of different analytical methods. J. Sediment. Res. 73(1), 47–57 (2003). https://doi.org/10.1306/070102730047
Wang, H., Liu, Q., Mok, H.M., Fu, L., Tse, W.M.: A hyperspherical transformation forecasting model for compositional data. Eur. J. Oper. Res. 179(2), 459–468 (2007). https://doi.org/10.1016/j.ejor.2006.03.039
Wang, H., Meng, J., Tenenhaus, M.: Regression modelling analysis on compositional data. In Handbook of Partial Least Squares, pages 381–406. Springer, 2010
Wang, H., Gu, J., Wang, S.: An effective intrusion detection framework based on svm with feature augmentation. Knowl.-Based Syst. 136, 130–139 (2017). https://doi.org/10.1016/j.knosys.2017.09.014
Wang, H., Lu, S., Zhao, J.: Aggregating multiple types of complex data in stock market prediction: a model-independent framework. Knowl. Based Syst. 164, 193–204 (2019). https://doi.org/10.1016/j.knosys.2018.10.035
Weiss, S., Xu, Z.Z., Peddada, S., Amir, A., Bittinger, K., Gonzalez, A., Lozupone, C., Zaneveld, J.R., Vázquez-Baeza, Y., Birmingham, A., et al.: Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 5, 27 (2017). https://doi.org/10.1186/s40168-017-0237-y
Funding
This study is funded by National Natural Science Foundation of China (Nos. 72371257, 72001222, 71873012). RG is partially supported by Humanities and Social Science General Program of the Ministry of Education of China (No. 23YJC910002). SL thanks the support from Jing Ying Scholar Support Program in Central University of Finance and Economics (CUFE) and is a member of Financial Sustainable Development Research Team in CUFE. SL, WW and RG also thank the support from Program for Innovation Research, the “Double First-Class” Disciplinary Project and the Disciplinary Funding in CUFE.
Author information
Authors and Affiliations
Contributions
SL: Conceptualization; Methodology; Formal analysis; Writing—original draft; Writing—review & editing. WW: Formal analysis; Writing—review & editing. RG: Conceptualization; Methodology; Writing—review & editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, S., Wang, W. & Guan, R. Kent feature embedding for classification of compositional data with zeros. Stat Comput 34, 69 (2024). https://doi.org/10.1007/s11222-024-10382-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10382-z