A novel discretization algorithm based on multi-scale and information entropy

Xun, Yaling; Yin, Qingxia; Zhang, Jifu; Yang, Haifeng; Cui, Xiaohui

doi:10.1007/s10489-020-01850-w

A novel discretization algorithm based on multi-scale and information entropy

Published: 12 September 2020

Volume 51, pages 991–1009, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yaling Xun ORCID: orcid.org/0000-0002-9590-6619¹,
Qingxia Yin¹,
Jifu Zhang¹,
Haifeng Yang¹ &
…
Xiaohui Cui¹

526 Accesses
8 Citations
Explore all metrics

Abstract

Discretization is one of the data preprocessing topics in the field of data mining, and is a critical issue to improve the efficiency and quality of data mining. Multi-scale can reveal the structure and hierarchical characteristics of data objects, the representation of the data in different granularities will be obtained if we make a reasonable hierarchical division for a research object. The multi-scale theory is introduced into the process of data discretization and a data discretization method based on multi-scale and information entropy called MSE is proposed. MSE first conducts scale partition on the domain attribute to obtain candidate cut point set with different granularity. Then, the information entropy is applied to the candidate cut point set, and the candidate cut point with the minimum information entropy is selected and detected in turn to determine the final cut point set using the MDLPC criterion. In such way, MSE avoids the problem that the candidate cut points are limited to only certain limited attribute values caused by considering only the statistical attribute values in the traditional discretization methods, and reduces the number of candidates by controlling the data division hierarchy to an optimal range. Finally, the extensive experiments show that MSE achieves high performance in terms of discretization efficiency and classification accuracy, especially when it is applied to support vector machines, random forest, and decision trees.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 4

A two-stage discretization algorithm based on information entropy

Article 24 May 2017

A Comparison of Two Approaches to Discretization: Multiple Scanning and C4.5

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

Article 03 March 2018

References

Bache K, Lichman M (1998) Uci repository of machine learning databases http://archive.ics.uci.edu/ml
Breiman L, Friedman J H, Olshen R A, Stone C J (1984) Classification and regression trees. belmont, ca: Wadsworth. Int Group 432:151–166
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Cano A, Luna JM, Gibaja EL, Ventura S (2016a) Laim discretization for multi-label data. Inf Sci 330:370–384
Cano A, Nguyen DT, Ventura S, Cios KJ (2016b) ur-caim: improved caim discretization for unbalanced and balanced data. Soft Comput 20(1):173–188
Cao F, Tang C, Zhang J (2017) Algorithm of continuous attribute discretization based on binary ant colony and rough sets. Comput Sci 44(9):222–226
Google Scholar
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):1–27
Article Google Scholar
Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15(4):319–331
Article MATH Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning
Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F (2012) A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750
Article Google Scholar
Han Y, Zhao S, Liu M, Luo Y, Ding Y (2016) Multi-scale clustering mining algorithm. Comput Sci 43(8):244–248
Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
MathSciNet MATH Google Scholar
Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11(1):63–90
Article MATH Google Scholar
Jiang F, Sui Y (2015) A novel approach for discretization of continuous attributes in rough set theory. Knowl-Based Syst 73:324–334
Article Google Scholar
John GH, Langley P (2013) Estimating continuous distributions in bayesian classifiers, pp 338–345. arXiv:13024964
Kerber R (1992) Chimerge: Discretization of numeric attributes. In: Proceedings of the tenth national conference on Artificial intelligence, pp 123–128
Kurgan LA, Cios KJ (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153
Article Google Scholar
Li C, Zhao S, Zhao J, Gao L, Chi Y (2017) Scaling-up algorithm of multi-scale association rules. Comput Sci 44(08):285–289
Google Scholar
Liu X, Jiang H, Wu D (2013) Improved algorithm based on cacc for discretization of continuous data [j]. Computer Engineering 4
Liu M, Zhao S, Min C (2015) Scaling-up mining algorithm of multi-scale association rules mining. Appl Res Comput 32(10):2924–2929
Google Scholar
Min H (2009) A global discretization and attribute reduction algorithm based on k-means clustering and rough sets theory. In: 2009 Second international symposium on knowledge acquisition and modeling, vol 2. IEEE, pp 92–95
Ramírez-Gallego S, García S, et al. (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev Data Min Knowl Discov 6(1):5–21
Article Google Scholar
Sang Y, Li K, Shen Y (2010) Ebda: An effective bottom-up discretization algorithm for continuous attributes. In: 2010 10th IEEE International Conference on Computer and Information Technology. IEEE, pp 2455–2462
Shi H, Fu J (2005) A global discretization method based on rough sets. In: 2005 International conference on machine learning and cybernetics, vol 5. IEEE, pp 3053–3057
Thaiphan R, Phetkaew T (2018) Comparative analysis of discretization algorithms on decision tree. In: 2018 IEEE/ACIS 17Th international conference on computer and information science (ICIS). IEEE, pp 63–67
Tsai CJ, Lee CI, Yang WP (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731
Article Google Scholar
Wen LY, Min F, Wang SY (2017) A two-stage discretization algorithm based on information entropy. Appl Intell 47(4):1169–1185
Article Google Scholar
Wong AK, Chiu DK (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence (6):796–805
Wu X, Kumar V (2009) The top ten algorithms in data mining. CRC Press, Boca Raton
Xie H, Cheng H, Niu D (2005) Discretization of continuous attributes in rough set theory based on information entropy. Chin J Comput 28(9):1570–1574
Google Scholar
Xun Y, Zhang J, Qin X (2015) Fidoop: Parallel mining of frequent itemsets using mapreduce. IEEE Trans Syst Man Cybern Syst 46(3):313–325
Article Google Scholar
Xun Y, Zhang J, Qin X, Zhao X (2016) Fidoop-dp: Data partitioning in frequent itemset mining on hadoop clusters. IEEE Trans Parallel Distrib Syst 28(1):101–114
Article Google Scholar
Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1):39–74
Article Google Scholar
Zhang J, Li X et al (2012) A soft discretization method of celestial spectrum characteristic line based on fuzzy c-means clustering. Spectrosc Spectr Anal 32(5):1435–1438
Google Scholar
Zhang J, Feng C, Tang C (2018) Discretization algorithm based on genetic algorithm and variable precision rough set. J Central China Normal Univ 52(03):322–328
Google Scholar
Zhang F, Zhao S, Wu Y (2019) Data scaling method for multi-scale data mining. Computer Science
Zhao J, Zhou YH (2009) New heuristic method for data discretization based on rough set theory. Journal of China Universities of Posts and Teleconnunications (6):113–120

Download references

Funding

This work is supported by the National Natural Science Foundation of P.R. China (No.61602335, 61876122), Natural Science Foundation of Shanxi Province, P. R. China (No.201901D211302), Taiyuan University of Science and Technology Scientific Research Initial Funding of Shanxi Province, P. R. China (No.20172017), and Scientific and Technological Innovation Team of Shanxi Province, P. R. China (No. 201805D131007).

Author information

Authors and Affiliations

Taiyuan University of Science and Technology (TYUST), Taiyuan, Shanxi, 030024, China
Yaling Xun, Qingxia Yin, Jifu Zhang, Haifeng Yang & Xiaohui Cui

Authors

Yaling Xun
View author publications
You can also search for this author in PubMed Google Scholar
Qingxia Yin
View author publications
You can also search for this author in PubMed Google Scholar
Jifu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yaling Xun.

Ethics declarations

Conflict of interests

The authors declare that we have no conflict of interest.

Additional information

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xun, Y., Yin, Q., Zhang, J. et al. A novel discretization algorithm based on multi-scale and information entropy. Appl Intell 51, 991–1009 (2021). https://doi.org/10.1007/s10489-020-01850-w

Download citation

Published: 12 September 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s10489-020-01850-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel discretization algorithm based on multi-scale and information entropy

Abstract

Access this article

Similar content being viewed by others

A two-stage discretization algorithm based on information entropy

A Comparison of Two Approaches to Discretization: Multiple Scanning and C4.5

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Ethical approval

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel discretization algorithm based on multi-scale and information entropy

Abstract

Access this article

Similar content being viewed by others

A two-stage discretization algorithm based on information entropy

A Comparison of Two Approaches to Discretization: Multiple Scanning and C4.5

EF_Unique: An Improved Version of Unsupervised Equal Frequency Discretization Method

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Ethical approval

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation