Abstract
Data discretization is an important task for certain types of data mining algorithms such as association rule discovery and Bayesian learning. For those algorithms, proper discretization not only can significantly improve the quality and understandability of discovered knowledge, but also can reduce the running time. We present a Global Unsupervised Discretization Algorithm based on Collective Correlation Coefficient (GUDA-CCC) that provides the following attractive merits. 1) It does not require class labels from training data. 2) It preserves the ranks of attribute importance in a data set and meanwhile minimizes the information loss measured by mean square error. The attribute importance is calibrated by the CCC derived from principal component analysis (PCA). The idea behind GUDA-CCC is that to stick closely to an original data set might be the best policy, especially when other available information is not reliable enough to be leveraged in the discretization. Experiments on benchmark data sets illustrate the effectiveness of the GUDA-CCC algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zeng, A., Pan, D., Zheng, Q.L., Peng, H.: Knowledge Acquisition based on Rough Set Theory and Principal Component Analysis. IEEE Intelligent Systems 21, 78–85 (2006)
Lloyd, S.P.: Least Squares Quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982)
Liu, H., Hussain, F., Tan, C., Dash, M.: Discretization: An Enabling Technique. Data Mining and Knowledge Discovery 6(4), 393–423 (2002)
Kurgan, L.A., Cios, K.J.: CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engineering 16, 145–153 (2004)
Tsai, C.J., Lee, C.I., Yang, W.P.: A Discretization Algorithm based on Class-attribute Contingency Coefficient. Information Sciences 178, 714–731 (2008)
Yang, Y., Webb, G.I.: Discretization for Naïve-Bayes Learning: Managing Discretization Bias and Variance. Machine Learning 74, 39–74 (2009)
Au, W.H., Chan, K.C.C., Wong, A.K.C.: A Fuzzy Approach to Partitioning Continuous Attributes for Classification. IEEE Transactions on Knowledge and Data Engineering 18, 715–719 (2006)
Bondu, A., Boulle, M., Lemaire, V., Loiseru, S., Duval, B.: A Non-parametric Semi-supervised Discretization Method. In: Proceedings of 2008 Eighth International Conference on Data Mining, pp. 53–62 (2008)
Mehta, S., Parthasarathy, S., Yang, H.: Toward Unsupervised Correlation Preserving Discretization. IEEE Transactions on Knowledge and Data Engineering 17, 1174–1185 (2005)
Li, X.L., Shao, Z.J.: An Optimizing Method base on Autonomous Animals: Fish-Swarm Algorithm. Systems Engineering-Theory & Practice 11, 32–38 (2002) (in Chinese)
Reynolds, C.W.: Flocks, Herds, and Schools: a Distributed Behavioral Model. Computer Graphics 21, 25–34 (1987)
Fayyad, U.M., Irani, K.B.: On the Handling of Continuous-Valued Attributes in Decision Tree Generation. Machine Learning 8, 87–102 (1992)
Kononenko, I.: On Biases in Estimating Multi-Valued Attributes. In: Proceedings of 14th International Joint Conference on Artificial Intelligence, pp. 1034–1040 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zeng, A., Gao, QG., Pan, D. (2011). A Global Unsupervised Data Discretization Algorithm Based on Collective Correlation Coefficient. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds) Modern Approaches in Applied Intelligence. IEA/AIE 2011. Lecture Notes in Computer Science(), vol 6703. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21822-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-21822-4_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21821-7
Online ISBN: 978-3-642-21822-4
eBook Packages: Computer ScienceComputer Science (R0)