Abstract
Decision trees are a classification model that allow rule generation. Depending upon the type of decision tree model, rules may have one to hundreds of conditions and with repeating data attributes over different conditional values causing the rules to be difficult to understand. To achieve more understandable rules, the number of nodes can be minimized to control the depth of the tree and, therefore, the number of conditions in the rules. Further, the study described in this paper seeks to optimize the decision tree for the generation of rules specific to the infrequent class which presents another challenge since the infrequent class may have few instances in the dataset. Rules that are generated using either decision trees or class association mining generally come from the major class of the dataset. These two mining techniques, decision trees and association mining, are utilized together through ensemble learning in an adaptable manner so that they expand and contract to accommodate the characteristics of the dataset. The ensemble learning occurs in phases: a partially generated or minimized decision tree mining phase, and association mining phase, to increase the probability of finding infrequent class rules. The ensemble learning technique developed in this study is found to generate understandable rules with increased coverage and confidence for the infrequent class with balanced or unbalanced datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
ACT (2004) What works in student retention—four-year public institutions. ACT Inc. 10, Iowa City
Agrawal R, Srikant R (1994) Fast algorithm for mining association rules. In: International conference on very large databases, pp 487–499
Alldrin N, Smith A, Turnbull D (2003) Clustering with EM and K-means. University of San Diego, California
Baker RSJD, Yacef K (2009) The state of educational data mining in 2009: a review and future visions. J Educ Data Min 1(1):3–17
Bayer J, Bydzovska H, Geryk J, Obsivac T, Popelinsky L (2012) Predicting drop-out from social behaviour of students. In: Proceedings of the 5th international conference on educational data mining, pp 103–109
Bean J, Eaton B (2001) The psychology underlying successful retention practices. J Coll Stud Retent Res Theory Pract 3:73–89
Boston WE, Ice P, Gibson AM (2011) Comprehensive assessment of student retention in online learning environments. Online J Distance Learn Adm IV(I):1593–1599
Byers González J, DesJardins S (2002) Artificial neural networks: a new approach for predicting application behavior. Res High Educ 43(2):235–258
Cattell RB (1966) The scree test for the number of factors. Multivar Behav Res 1:245–276
Datta S, Mengel S (2015) Multi-phase decision method to generate rules for student retention. J Comput Sci Coll 31(2):65–71
Datta S, Mengel S (2016) Elastic multi-stage decision rules for infrequent class. In: 3rd international conference on soft computing and machine intelligence (ISCMI), pp 110–114
DeBerard MS, Spielmans GI, Julka DC (2004) Predictors of academic achievement and retention among college freshmen: a longitudinal study. Coll Stud J 38:66–80
Delen D (2010) A comparative analysis of machine learning techniques for student retention management. Decis Cover Syst 49:498–506
Eitel JML, Baron JD, Devireddy M, Sundararaju V, Jayaprakash, SM (2012) Mining academic data to improve college student retention: an open source perspective. In: International conference on learning analytics and knowledge, pp 139–142
Gaudard M, Ramsey P, Stephens M (2006) Interactive data mining and design of experiments: the JMP partition and custom design platforms. New Haven Group, New Haven
Hagedorn LS (2005) How to define retention. In: Seidman A (ed) College student retention: formula for student success. Praeger Publishers, Westport, pp 89–106
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1):10–18
Herzog S (2006) Estimating student retention and degree completion time: Decision trees and neural networks vis-a-vis regression. New Dir Inst Res 131:17–33
Huo J, Wang X, Lu M, Chen J (2006) Induction of multi-phase decision tree. In: IEEE international conference on systems, man, and cybernetics
Joshi MV, Watson TJ, Agarwal RC (2001) mining needles in a haystack: classifying rare classes via two-phase rule induction. In: ACM SIGMOD
Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20:141–151
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken
Kerkvliet J, Nowell C (2005) Does one size fit all? University differences in the influence of wages, financial aid, and integration on student retention. Econ Educ Rev 24:85–95
Kotsiantis S (2009) Educational data mining: a case study for predicting dropout-prone students. Int J Knowl Eng Soft Data Paradig 1(2):101–111
Lin HS (2012) Data mining for student retention management. J Comput Sci Coll 27(4):92–99
Lu M, Huo J, Chen CLP, Wang X (2009) Multi-phase decision tree based on inter-class and inner-class margin of SVM. In: Proceedings of the IEEE international conference on systems, man, and cybernetics, pp 1875–1880
Luan J (2002) Data mining and its applications in higher education. In: Serban AM, Luan J (eds) knowledge management: building a competitive advantage in higher education. New directions for institutional research, no. 113. Jossey-Bass, San Francisco
Lykourentzou I, Giannoukos I, Nikolopoulos V, Mpardis G, Loumos V (2009) Dropout prediction in e-learning courses through the combination of machine learning techniques. Comput Educ 53:950–965
Macfadyen LP, Dawson S (2010) Mining LMS data to develop an early warning system for educators: a proof of concept. Comput Educ 54:588–599
Mallinckrodt B, Sedlacek WE (1987) Student retention and the use of campus facilities by race. NASPA J 24:28–32
Marquez-Vera C, Cano A, Romero C, Ventura S (2013) Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38:315–330
Mellalieu PJ (2011) Predicting success, excellence and retention from student’s early course performance: progress results from a data mining decision coverage system in a first year tertiary education programme. In: XXIX international conference of the international council for higher education
Nandeshwar A, Menzies T, Nelson A (2011) Learning patterns of university student retention. Expert Syst Appl 38:14984–14996
Nara A, Barlow E, Crisp G (2005a) Student persistence and degree attainment beyond the first year in college: the need for research. In: Seidman A (ed) College student retention. Praeger, Westport, pp 129–153
Nara A, Barlow E, Crisp G (2005b) Student persistence and degree attainment beyond the first year in college: The need for research. In: Student College (ed) Alan Seidman. Praeger, Retention, pp 129–153
National Audit Office (2007) Staying the course: the retention of students in higher education
Pittman K (2008) Comparison of data mining techniques used to predict student retention. Doctoral dissertation, Nova Southeastern University, Fort Lauderdale
Schmitt N, Oswald FL, Kim BH, Imus A, Merritt S, Friede A, Shivpuri S (2007) The use of background and ability profiles to predict college student outcomes. J Appl Psychol 92(1):165–179
Senator T (2005) Multi-phase classification. In: Proceedings of the fifth IEEE international conference on data mining, pp 386–393. [NOTE: tie linkage-analysis to clustering]
Sewell W, Wegner E (1970) Selection and context as factors affecting the probability of graduation from college. Am J Sociol 75(4):665–679
Superby JF, Vandamme J-P, Meskens N (2006) Determination of factors influencing the achievement of the first-year university students using data mining methods. In: Workshop on educational data mining
Thomas E, Galambos N (2004) What satisfies students? Mining student-opinion data with regression and decision tree analysis. Res High Educ 45(3):251–269
Tinto V (1975) Dropout from higher education: a theoretical synthesis of recent research. Rev Educ Res 45:89–125
Tinto V (1993) Leaving college: rethinking the causes and curses of student attrition. University of Chicago Press, Chicago
Tinto V (2006) Research and practice of student retention: what next? J Coll Stud Retent 8(1):1–19
Tinto V, Russo P, Kadel S (1994) Constructing educational communities in challenging circumstances. Community Coll J 64(1):26–30
Van Nelson C, Neff K (1990) Comparing and contrasting neural network solutions to classical statistical solutions. Paper presented at the Midwestern Educational Research Association Conference, Chicago, Oct. 19:1990
Wang Xizhao, He Qiang, Chen Degang (2005) A genetic algorithm for solving the inverse problem of support vector machines. Sci Direct 68:225–238
Witten IH, Frank E (2005) Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Wu X, Kumar V, Quinlan JR, Ghosg J, Yang Q, Motada H, McLachlan GJ, Ng A, Lui B, Yu PS, Zhou Z, Steibach M, Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining. In: IEEE, international conference, survey paper
UCI Repository of machine learning databases and domain theories. FTP address: ftp://www.ftp.ics.uci.edu/pub/machine-learning-databases. UC Irvine Machine Learning Repository (UCI) http://www.archive.ics.uci.edu/ml/. Accessed 15 Apr 2018
Yadav KS, Bharadway B, Pal S (2012) Mining Education data to predict student’s retention: a comparative study. Int J Comput Sci Inf Secur 10(2):113–117
Yu HC, DiGangi S, Jannasch-Pennell A, Kaprolet C (2010) A data mining approach for identifying predictors of student retention from sophomore to junior year. J Data Sci 8:307–325
Yu H, Ni J, Dan Y, Xu S (2012) Mining and integrating reliable decision rules for imbalance cancer gene expression data sets. Tsinghua Sci Technol 17(6): 666–673. ISBN: 1007-021407/10
Zhang Y, Oussena S, Clark T, Kim HT (2010) Use data mining to improve student retention in higher education—a case study. In: 12th international conference on enterprise information systems 2010, Paper Nr-129
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All the authors declare that they have no conflicts of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by S. Deb, T. Hanne, K. C. Wong.
Rights and permissions
About this article
Cite this article
Datta, S., Mengel, S. Adaptable multi-phase rules over the infrequent class. Soft Comput 22, 6067–6076 (2018). https://doi.org/10.1007/s00500-018-3399-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3399-z