Abstract
Data reduction is crucial in order to turn large datasets into information, the major purpose of data science. The classic and richer area of dimensionality reduction (DR) has traditionally been based on feature extraction by combining primary features in a linear fashion, aiming to preserve or maintain covariance/correlations between the features. Nonlinear alternatives have been developed, including information-theoretic approaches using mutual information as well and conditional entropy based on target features. Here, we further this approach to feature selection or reduction strategy based on the concept of conditional Shannon entropy of two random variables. Novel results include (a) a dimensionality reduction method based on conditional entropy between predictors themselves along two variants, disregarding the influence of the target feature; (b) an error-prevention method inspired by error-detection and correction in information theory for DR with genomic data that can be used for abiotic data as well; and (c) a comparative assessment of the performance of several machine learning models on input features selected by these methods. We assess the quality of the techniques based on their performance in solving three application problems (Malware Classification, BioTaxonomy, and Noisy Classification) of various degrees of difficulty with competitive outcomes. Some useful heuristics arise from the analysis of the results and also suggest some problems of interest for further research.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
All data for the MC and BT problem are publicly available. The synthetic dataset used in the Noisy classification problem is available in the supplementary materials.
Code Availability Statement
The software used to run these applications is a part of the sklearn package, or is publicly available at bmc.memphis.edu/DSAx/, where 3D plots that allow full comparison of the results on all models and data sets can also be found.
References
Adam, B.L., Qu, Y., Davis, J.W., Ward, M.D., Clements, M.A., Cazares, L.H., Semmes, O.J., Schellhammer, P.F., Yasui, Y., Feng, Z., et al.: Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res. 62(13), 3609–3614 (2002)
Aiello, S., Click, C., Roark, H., Rehak, L.: Machine learning with python and h2o. Edited by Lanford, J, Published by H 20 (2016)
Boltzmann, L.: On some problems of the mechanical theory of heat. Lond. Edinburgh Dublin Philos. Mag. J. Sci. 6(36), 236–237 (1878)
Bouzas, D., Arvanitopoulos, N., Tefas, A.: Graph embedded nonparametric mutual information for supervised dimensionality reduction. IEEE Trans. Neural Netw. Learn. Syst. 26(5), 951–963 (2014)
Brown, G., Pocock, A., Zhao, M.J., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13(1), 27–66 (2012)
Chanda, P., Costa, E., Hu, J., Sukumar, S., Van Hemert, J., Walia, R.: Information theory in computational biology: where We stand today. Entropy 22(6), 627 (2020)
Chang, C.H., Hsieh, L.C., Chen, T.Y., Chen, H.D., Luo, L., Lee, H.C.: Shannon information in complete genomes. J. Bioinform. Comput. Biol. 3(03), 587–608 (2005)
Chen, S., Deng, L.Y., Bowman, D., Shiau, J.J.H., Wong, T.Y., Madahian, B., Lu, H.H.S.: Phylogenetic tree construction using trinucleotide usage profile (TUP). BMC BIoinform. 17(13), 117–130 (2016)
Clausius, R.: The mechanical theory of heat, nine memoirs on the development of concept of “entropy” (1850)
Colorado-Garzón, F.A., Adler, P.H., García, L.F., Muñoz de Hoyos, P., Bueno, M.L., Matta, N.E.: Estimating diversity of black flies in the simulium ignescens and simulium tunja complexes in colombia: chromosomal rearrangements as the core of integrative taxonomy. J. Hered. 108(1), 12–24 (2017)
De Queiroz, K.: Ernst Mayr and the modern concept of species. Proc. Natl. Acad. Sci. 102(suppl 1), 6600–6607 (2005)
Diaz, S.A., Moncada, L.I., Murcia, C.H., Lotta, I.A., Matta, N.E., Adler, P.H.: Integrated taxonomy of a new species of black fly in the subgenus trichodagmia (diptera: Simuliidae) from the páramo region of colombia. Zootaxa 3914(5), 541–557 (2015)
Faivishevsky, L., Goldberger, J.: Dimensionality reduction based on non-parametric mutual information. Neurocomputing 80, 31–37 (2012)
Garraffoni, A.R., Araújo, T.Q., Lourenço, A.P., Guidi, L., Balsamo, M.: Integrative taxonomy of a new redudasys species (gastrotricha: Macrodasyida) sheds light on the invasion of fresh water habitats by macrodasyids. Sci. Rep. 9(1), 1–15 (2019)
Garzon, M., Neathery, P., Deaton, R., Murphy, R.C., Franceschetti, D.R., Stevens Jr, S.: A new metric for DNA computing. In: Proceedings of the 2nd Genetic Programming Conference, Morgan Kaufman, pp 472–478 (1997)
Garzon, M.H.: DNA codeword design: theory and applications. Parallel Process. Lett. 24(02), 1440001 (2014)
Garzon, M.H., Bobba, K.C.: A geometric approach to Gibbs energy landscapes and optimal DNA codeword design. In: International Workshop on DNA-Based Computers, pp. 73–85. Springer (2012)
Garzon, M.H., Mainali, S.: Towards a universal genomic positioning system: phylogenetics and species Identification. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 469–479. Springer (2017a)
Garzon, M.H., Mainali, S.: Towards reliable microarray analysis and design. In: 9th International Conference on Bioinformatics and Computational Biology, ISCA, 6p (2017b)
Goldberger, A.L., Peng, C.K.: Genomic classification using an information-based similarity index: application to the SARS coronavirus. J. Comput. Biol. 12(8), 1103–1116 (2005)
Guyon, I.: Design of experiments of the nips 2003 variable selection benchmark. In: NIPS 2003 workshop on feature extraction and feature selection, vol. 253 (2003)
van der Heijden, F.: Edge and line feature extraction based on covariance models. IEEE Trans. Pattern Anal. Mach. Intell. 17(1), 16–33 (1995)
Hsieh, P.F., Wang, D.S., Hsu, C.W.: A linear feature extraction for multiclass classification problems based on class mean and covariance discriminant information. IEEE Trans. Pattern Anal. Mach. Intell. 28(2), 223–235 (2005)
Kumar, S., Stecher, G., Suleski, M., Hedges, S.B.: TimeTree: a resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34(7), 1812–1819 (2017)
Kwak, N., Choi, C.H.: Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1667–1671 (2002)
Linnaeus, C.: Systema naturae, vol 1. Stockholm Laurentii Salvii (1758)
Machado, J.T., Costa, A.C., Quelhas, M.D.: Shannon, Rényie and Tsallis entropy analysis of DNA using phase plane. Nonlinear Anal. Real World Appl. 12(6), 3135–3144 (2011)
Mainali, S., Colorado-Garzon, F.A., Garzon, M.: Foretelling the phenotype of a genomic sequence. IEEE/ACM Trans. Comput. Biol, Bioinform (2020a)
Mainali, S., Garzon, M.H., Colorado, F.A.: New genomic information systems (GenISs): species delimitation and identification. In: International Work-Conference on Bioinformatics and Biomedical Engineering, Springer, pp 163–174 (2020b)
Mainali, S., Garzon, M.H., Colorado, F.A.: Profiling environmental conditions from DNA. In: International Work-Conference on Bioinformatics and Biomedical Engineering, pp. 647–658. Springer (2020c)
Melzer, T., Reiter, M., Bischof, H.: Nonlinear feature extraction using generalized canonical correlation analysis. In: International Conference on Artificial Neural Networks, pp. 353–360. Springer (2001)
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.C., Lin, C.C., Meyer, M.D.: Package ‘e1071’. R. J. (2019)
Mizrachi, I.: GenBank: the nucleotide sequence database. The NCBI handbook [Internet], updated 22 (2007)
Ou, J.: Theory of portfolio and risk based on incremental entropy. J. Risk Finance (2005)
Parr, C.S., Wilson, M..N., Leary, M..P., Schulz, K.S., Lans, M.K., Walley, M.L., Hammock, J.A., Goddard, M.A., Rice, M.J., Studer, M.M., et al.: The encyclopedia of life v2: providing global access to knowledge about life on earth. Biodivers. Data J. (2) (2014)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Petit, R.J., Excoffier, L.: Gene flow and species delimitation. Trends Ecol. Evol. 24(7), 386–393 (2009)
Petricoin, E.F., Ornstein, D.K., Paweletz, C.P., Ardekani, A., Hackett, P.S., Hitt, B.A., Velassco, A., Trucco, C., Wiegand, L., Wood, K., et al.: Serum proteomic patterns for detection of prostate cancer. J. Natl. Cancer Inst. 94(20), 1576–1578 (2002)
Phan, V., Garzon, M.H.: On codeword design in metric DNA spaces. Natural Comput. 8(3), 571 (2009)
Philippatos, G.C., Wilson, C.J.: Entropy, market risk, and the selection of efficient portfolios. Appl. Econ. 4(3), 209–220 (1972)
Pramual, P., Kuvangkadilok, C.: Integrated cytogenetic, ecological, and DNA barcode study reveals cryptic diversity in simulium (gomphostilbia) angulistylum (diptera: Simuliidae). Genome 55(6), 447–458 (2012)
Ripley, B., Venables, W., Ripley, M.B.: Package ‘nnet’. R package version 7, 3–12 (2016)
Ronen, R., Radu, M., Feuerstein, C., Yom-Tov, E., Ahmadi, M.: Microsoft malware classification challenge. CoRR abs/1802.10135, arXiv:1802.10135 (2018)
SantaLucia, J.: A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. 95(4), 1460–1465 (1998)
Schena, M.: DNA microarrays: a practical approach. 205, Practical approach series (1999)
Shadvar, A.: Dimension reduction by mutual information feature extraction. arXiv preprint arXiv:1207.3394 (2012)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Shannon, C.E.: A note on the concept of entropy. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Sherwin, W.B.: Genes are information, so information theory is coming to the aid of evolutionary biology. Mol. Ecol. Resour. 15(6), 1259–1261 (2015)
Smouse, P.E., Whitehead, M.R., Peakall, R.: An informational diversity framework, illustrated with sexually deceptive orchids in early stages of speciation. Mol. Ecol. Resour. 15(6), 1375–1384 (2015)
Sulaiman, M.A., Labadin, J.: Feature selection based on mutual information. In: 2015 9th International Conference on IT in Asia (CITA), IEEE, pp 1–6 (2015)
Tsimring, L.S.: Noise in biology. Rep. Progr. Phys. 77(2), (2014)
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York. http://www.stats.ox.ac.uk/pub/MASS4/, iSBN 0-387-95457-0 (2002)
Vergara, J.R., Estévez, P.A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24(1), 175–186 (2014)
Vinga, S.: Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification. Adv. Comput. Methods Biocomput. Bioimaging 71, 107 (2007)
Vinga, S.: Information theory applications for biological sequence analysis. Brief. Bioinform. 15(3), 376–389 (2014)
Wake, M.H.: Integrative biology: science for the 21st century. BioScience 58(4), 349–353 (2008)
Wang, X., Liu, J., Chen, X.: Microsoft malware classification challenge (big 2015) first place team: say no to overfitting. no Big (2015)
Wilkins, J.S.: What is systematics and what is taxonomy. Evolving Thoughts (2011)
Xu, J., Zhou, X., Wu, D.D.: Portfolio selection using \(\lambda \) mean and hybrid entropy. Ann. Oper. Res. 185(1), 213–229 (2011)
Yan, J., Qi, Y., Rao, Q.: Detecting malware with an ensemble method based on deep neural network. Secur. Commun, Netw (2018)
Yang, C.H., Wu, K.C., Chuang, L.Y., Chang, H.W.: Deepbarcoding: deep learning for species classification using DNA barcoding. IEEE/ACM Trans. Comput. Biol, Bioinform (2021)
Yang, P., Zhou, H., Zhu, Y., Liu, L., Zhang, L.: Malware classification based on shallow neural network. Future Internet 12(12), 219 (2020)
Zhang, H., Xiao, X., Mercaldo, F., Ni, S., Martinelli, F., Sangaiah, A.K.: Classification of ransomware families with machine learning based on n-gram of opcodes. Future Gener. Comput. Syst. 90, 211–221 (2019)
Zhou, R., Cai, R., Tong, G.: Applications of entropy in finance: a review. Entropy 15(11), 4909–4931 (2013)
Acknowledgements
The use of HPC at the U of Memphis for processing datasets and training models is gratefully acknowledged. We are also grateful to the reviewers for valuable comments that resulted in substantial improvements to the quality and presentation of this work.
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Mainali, S., Garzon, M., Venugopal, D. et al. An Information-theoretic approach to dimensionality reduction in data science. Int J Data Sci Anal 12, 185–203 (2021). https://doi.org/10.1007/s41060-021-00272-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-021-00272-2