Abstract
Discretization is one of the data preprocessing topics in the field of data mining, and is a critical issue to improve the efficiency and quality of data mining. Multi-scale can reveal the structure and hierarchical characteristics of data objects, the representation of the data in different granularities will be obtained if we make a reasonable hierarchical division for a research object. The multi-scale theory is introduced into the process of data discretization and a data discretization method based on multi-scale and information entropy called MSE is proposed. MSE first conducts scale partition on the domain attribute to obtain candidate cut point set with different granularity. Then, the information entropy is applied to the candidate cut point set, and the candidate cut point with the minimum information entropy is selected and detected in turn to determine the final cut point set using the MDLPC criterion. In such way, MSE avoids the problem that the candidate cut points are limited to only certain limited attribute values caused by considering only the statistical attribute values in the traditional discretization methods, and reduces the number of candidates by controlling the data division hierarchy to an optimal range. Finally, the extensive experiments show that MSE achieves high performance in terms of discretization efficiency and classification accuracy, especially when it is applied to support vector machines, random forest, and decision trees.
Similar content being viewed by others
References
Bache K, Lichman M (1998) Uci repository of machine learning databases http://archive.ics.uci.edu/ml
Breiman L, Friedman J H, Olshen R A, Stone C J (1984) Classification and regression trees. belmont, ca: Wadsworth. Int Group 432:151–166
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Cano A, Luna JM, Gibaja EL, Ventura S (2016a) Laim discretization for multi-label data. Inf Sci 330:370–384
Cano A, Nguyen DT, Ventura S, Cios KJ (2016b) ur-caim: improved caim discretization for unbalanced and balanced data. Soft Comput 20(1):173–188
Cao F, Tang C, Zhang J (2017) Algorithm of continuous attribute discretization based on binary ant colony and rough sets. Comput Sci 44(9):222–226
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):1–27
Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15(4):319–331
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning
Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F (2012) A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750
Han Y, Zhao S, Liu M, Luo Y, Ding Y (2016) Multi-scale clustering mining algorithm. Comput Sci 43(8):244–248
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11(1):63–90
Jiang F, Sui Y (2015) A novel approach for discretization of continuous attributes in rough set theory. Knowl-Based Syst 73:324–334
John GH, Langley P (2013) Estimating continuous distributions in bayesian classifiers, pp 338–345. arXiv:13024964
Kerber R (1992) Chimerge: Discretization of numeric attributes. In: Proceedings of the tenth national conference on Artificial intelligence, pp 123–128
Kurgan LA, Cios KJ (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153
Li C, Zhao S, Zhao J, Gao L, Chi Y (2017) Scaling-up algorithm of multi-scale association rules. Comput Sci 44(08):285–289
Liu X, Jiang H, Wu D (2013) Improved algorithm based on cacc for discretization of continuous data [j]. Computer Engineering 4
Liu M, Zhao S, Min C (2015) Scaling-up mining algorithm of multi-scale association rules mining. Appl Res Comput 32(10):2924–2929
Min H (2009) A global discretization and attribute reduction algorithm based on k-means clustering and rough sets theory. In: 2009 Second international symposium on knowledge acquisition and modeling, vol 2. IEEE, pp 92–95
Ramírez-Gallego S, García S, et al. (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev Data Min Knowl Discov 6(1):5–21
Sang Y, Li K, Shen Y (2010) Ebda: An effective bottom-up discretization algorithm for continuous attributes. In: 2010 10th IEEE International Conference on Computer and Information Technology. IEEE, pp 2455–2462
Shi H, Fu J (2005) A global discretization method based on rough sets. In: 2005 International conference on machine learning and cybernetics, vol 5. IEEE, pp 3053–3057
Thaiphan R, Phetkaew T (2018) Comparative analysis of discretization algorithms on decision tree. In: 2018 IEEE/ACIS 17Th international conference on computer and information science (ICIS). IEEE, pp 63–67
Tsai CJ, Lee CI, Yang WP (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731
Wen LY, Min F, Wang SY (2017) A two-stage discretization algorithm based on information entropy. Appl Intell 47(4):1169–1185
Wong AK, Chiu DK (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence (6):796–805
Wu X, Kumar V (2009) The top ten algorithms in data mining. CRC Press, Boca Raton
Xie H, Cheng H, Niu D (2005) Discretization of continuous attributes in rough set theory based on information entropy. Chin J Comput 28(9):1570–1574
Xun Y, Zhang J, Qin X (2015) Fidoop: Parallel mining of frequent itemsets using mapreduce. IEEE Trans Syst Man Cybern Syst 46(3):313–325
Xun Y, Zhang J, Qin X, Zhao X (2016) Fidoop-dp: Data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans Parallel Distrib Syst 28(1):101–114
Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1):39–74
Zhang J, Li X et al (2012) A soft discretization method of celestial spectrum characteristic line based on fuzzy c-means clustering. Spectrosc Spectr Anal 32(5):1435–1438
Zhang J, Feng C, Tang C (2018) Discretization algorithm based on genetic algorithm and variable precision rough set. J Central China Normal Univ 52(03):322–328
Zhang F, Zhao S, Wu Y (2019) Data scaling method for multi-scale data mining. Computer Science
Zhao J, Zhou YH (2009) New heuristic method for data discretization based on rough set theory. Journal of China Universities of Posts and Teleconnunications (6):113–120
Funding
This work is supported by the National Natural Science Foundation of P.R. China (No.61602335, 61876122), Natural Science Foundation of Shanxi Province, P. R. China (No.201901D211302), Taiyuan University of Science and Technology Scientific Research Initial Funding of Shanxi Province, P. R. China (No.20172017), and Scientific and Technological Innovation Team of Shanxi Province, P. R. China (No. 201805D131007).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that we have no conflict of interest.
Additional information
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xun, Y., Yin, Q., Zhang, J. et al. A novel discretization algorithm based on multi-scale and information entropy. Appl Intell 51, 991–1009 (2021). https://doi.org/10.1007/s10489-020-01850-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01850-w