An Effective Method to Find Better Data Mining Model Using Inferior Class Oversampling

Sug, Hyontai

doi:10.1007/978-3-642-24106-2_73

An Effective Method to Find Better Data Mining Model Using Inferior Class Oversampling

Hyontai Sug⁴

Conference paper

1711 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 206))

Abstract

Decision trees are known to have very good performance in the task of data mining of classification, and sampling is often used to determine some proper training sets. Among many parameters the accuracy of generated decision trees depends upon training data sets much, so we want to find some better classification models from the given data sets by oversampling the instances that have higher error rates. The resulting decision trees have better accuracy for classes that had lower error rates, but have worse accuracy for classes that have higher error rates. In order to take advantage of the better accuracy and compensate the worse accuracy, we suggest using class association Experiments with real world data sets showed promising results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2006)
Google Scholar
Russel, S., Novig, P.: Artificial Intelligence: a Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs (2002)
Google Scholar
Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, Oxford (1995)
MATH Google Scholar
Heaton, J.: Introduction to Neural Networks for C#, 2nd edn. Heaton Research Inc. (2008)
Google Scholar
Lippmann, R.P.: An Introduction to Computing with Neural Nets. IEEE ASSP Magazine 3(4), 4–22 (1987)
Article Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc, San Francisco (1993)
Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International Group (1984)
Google Scholar
Larose, D.T.: Data Mining Methods and Models. Wiley Interscience, Hoboken (2006)
MATH Google Scholar
Fukunaga, K., Hayes, R.R.: Effects of Sample Size in Classifier Design. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(8), 873–885 (1989)
Article Google Scholar
Mazuro, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks 21(2-3), 427–436 (2008)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 341–378 (2002)
MATH Google Scholar
Agrawal, R., Mannila, H., Srikant, H.R., Toivonen, H., Verkamo, A.I.: Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smith, P., Uthurusamy, R. (eds.), pp. 307–328. AAAI Press/The MIT Press (1996)
Google Scholar
Zaki, M.J.: Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 12(3), 372–390 (2000)
Article MathSciNet Google Scholar
Park, J.S., Chen, M., Yu, P.S.: Using a Hash-Based Method with Transaction Trimming for Mining Association Rules. IEEE Transactions on Knowledge and Data Engineering 9(5), 813–825 (1997)
Article Google Scholar
Toivonen, H.: Discovery of Frequent Patterns in Large Data Collections. phD thesis, Department of Computer Science, University of Helsinki, Finland (1996)
Google Scholar
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery 8, 53–87 (2004)
Article MathSciNet Google Scholar
Li, W., Han, J., Pei, J.: CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. In: Proceedings 2001 Int. Conf. on Data Mining (ICDM 2001), pp. 369–376 (2001)
Google Scholar
Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD 1998), pp. 80–86 (1998)
Google Scholar
Toivonen, H., Klemettinen, M., Mannila, H., Rokainen, P., Hatonen, K.: Pruning and Grouping of Discovered Association Rules. In: Workshop Notes of the ECML 1995 Workshop on Statistics, Machine Learning and Knowledge Discovery in Databases, pp. 47–52 (1995)
Google Scholar
Dimitrijević, M., Bošnjak, Z.: Discovering Interesting Association Rules in the Web Log Usage Data. Interdisciplinary Journal of Information, Knowledge, and Management 5, 191–207 (2010)
Google Scholar
Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., Verkamo, A.I.: Finding Interesting Rules from Large Set of Discovered Association Rules. In: Proceedings of the Third International Conference on Information and Knowledge Management (CIKM 1994), pp. 401–407 (1994)
Google Scholar
Perng, C., Wang, H., Ma, S., Hellerstein, J.: Discovery in Multi-attribute Data with User-defined Constraints. ACM SIGKDD Explorations Newsletter 4(1), 56–64 (2002)
Article Google Scholar
Chithra, R., Nicklas, S.: A Novel Algorithm for Minng Hybrid-Dimensional Association Rules. International Journal of Computer Applications 1(16), 53–58 (2010)
Article Google Scholar
Suncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Sciences, Irvine, CA (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Google Scholar
Kohavi, R.: Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207 (1996)
Google Scholar
Statlog (Landsat Satellite) Data Set, http://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous valued attributes for classification learning. In: The Proceedings of Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Division of Computer and Information Engineering, Dongseo University, Busan, 617-716, Korea
Hyontai Sug

Authors

Hyontai Sug
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

90301, 3F, Computer Engineering Department, Hannam University, 70 Hannamro, Daedeuk-gu, Daejeon, Korea
Geuk Lee
QinetiQ Company Fellow, Howard Science Limited, 24 Sunrise, WR14 2NJ, Malvern, United Kingdom
Daniel Howard
Institute of Mathematics, University of Warsaw, ul. Banacha 2, 02-097, Warsaw, Poland
Dominik Ślęzak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sug, H. (2011). An Effective Method to Find Better Data Mining Model Using Inferior Class Oversampling. In: Lee, G., Howard, D., Ślęzak, D. (eds) Convergence and Hybrid Information Technology. ICHIT 2011. Communications in Computer and Information Science, vol 206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24106-2_73

Download citation

DOI: https://doi.org/10.1007/978-3-642-24106-2_73
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24105-5
Online ISBN: 978-3-642-24106-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics