Skip to main content
Log in

Adaptable multi-phase rules over the infrequent class

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Decision trees are a classification model that allow rule generation. Depending upon the type of decision tree model, rules may have one to hundreds of conditions and with repeating data attributes over different conditional values causing the rules to be difficult to understand. To achieve more understandable rules, the number of nodes can be minimized to control the depth of the tree and, therefore, the number of conditions in the rules. Further, the study described in this paper seeks to optimize the decision tree for the generation of rules specific to the infrequent class which presents another challenge since the infrequent class may have few instances in the dataset. Rules that are generated using either decision trees or class association mining generally come from the major class of the dataset. These two mining techniques, decision trees and association mining, are utilized together through ensemble learning in an adaptable manner so that they expand and contract to accommodate the characteristics of the dataset. The ensemble learning occurs in phases: a partially generated or minimized decision tree mining phase, and association mining phase, to increase the probability of finding infrequent class rules. The ensemble learning technique developed in this study is found to generate understandable rules with increased coverage and confidence for the infrequent class with balanced or unbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • ACT (2004) What works in student retention—four-year public institutions. ACT Inc. 10, Iowa City

    Google Scholar 

  • Agrawal R, Srikant R (1994) Fast algorithm for mining association rules. In: International conference on very large databases, pp 487–499

  • Alldrin N, Smith A, Turnbull D (2003) Clustering with EM and K-means. University of San Diego, California

    Google Scholar 

  • Baker RSJD, Yacef K (2009) The state of educational data mining in 2009: a review and future visions. J Educ Data Min 1(1):3–17

    Google Scholar 

  • Bayer J, Bydzovska H, Geryk J, Obsivac T, Popelinsky L (2012) Predicting drop-out from social behaviour of students. In: Proceedings of the 5th international conference on educational data mining, pp 103–109

  • Bean J, Eaton B (2001) The psychology underlying successful retention practices. J Coll Stud Retent Res Theory Pract 3:73–89

    Article  Google Scholar 

  • Boston WE, Ice P, Gibson AM (2011) Comprehensive assessment of student retention in online learning environments. Online J Distance Learn Adm IV(I):1593–1599

    Google Scholar 

  • Byers González J, DesJardins S (2002) Artificial neural networks: a new approach for predicting application behavior. Res High Educ 43(2):235–258

    Article  Google Scholar 

  • Cattell RB (1966) The scree test for the number of factors. Multivar Behav Res 1:245–276

    Article  Google Scholar 

  • Datta S, Mengel S (2015) Multi-phase decision method to generate rules for student retention. J Comput Sci Coll 31(2):65–71

    Google Scholar 

  • Datta S, Mengel S (2016) Elastic multi-stage decision rules for infrequent class. In: 3rd international conference on soft computing and machine intelligence (ISCMI), pp 110–114

  • DeBerard MS, Spielmans GI, Julka DC (2004) Predictors of academic achievement and retention among college freshmen: a longitudinal study. Coll Stud J 38:66–80

    Google Scholar 

  • Delen D (2010) A comparative analysis of machine learning techniques for student retention management. Decis Cover Syst 49:498–506

    Article  Google Scholar 

  • Eitel JML, Baron JD, Devireddy M, Sundararaju V, Jayaprakash, SM (2012) Mining academic data to improve college student retention: an open source perspective. In: International conference on learning analytics and knowledge, pp 139–142

  • Gaudard M, Ramsey P, Stephens M (2006) Interactive data mining and design of experiments: the JMP partition and custom design platforms. New Haven Group, New Haven

    Google Scholar 

  • Hagedorn LS (2005) How to define retention. In: Seidman A (ed) College student retention: formula for student success. Praeger Publishers, Westport, pp 89–106

    Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explorations 11(1):10–18

    Article  Google Scholar 

  • Herzog S (2006) Estimating student retention and degree completion time: Decision trees and neural networks vis-a-vis regression. New Dir Inst Res 131:17–33

    Google Scholar 

  • Huo J, Wang X, Lu M, Chen J (2006) Induction of multi-phase decision tree. In: IEEE international conference on systems, man, and cybernetics

  • Joshi MV, Watson TJ, Agarwal RC (2001) mining needles in a haystack: classifying rare classes via two-phase rule induction. In: ACM SIGMOD

  • Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20:141–151

    Article  Google Scholar 

  • Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken

    Book  MATH  Google Scholar 

  • Kerkvliet J, Nowell C (2005) Does one size fit all? University differences in the influence of wages, financial aid, and integration on student retention. Econ Educ Rev 24:85–95

    Article  Google Scholar 

  • Kotsiantis S (2009) Educational data mining: a case study for predicting dropout-prone students. Int J Knowl Eng Soft Data Paradig 1(2):101–111

    Article  Google Scholar 

  • Lin HS (2012) Data mining for student retention management. J Comput Sci Coll 27(4):92–99

    Google Scholar 

  • Lu M, Huo J, Chen CLP, Wang X (2009) Multi-phase decision tree based on inter-class and inner-class margin of SVM. In: Proceedings of the IEEE international conference on systems, man, and cybernetics, pp 1875–1880

  • Luan J (2002) Data mining and its applications in higher education. In: Serban AM, Luan J (eds) knowledge management: building a competitive advantage in higher education. New directions for institutional research, no. 113. Jossey-Bass, San Francisco

    Google Scholar 

  • Lykourentzou I, Giannoukos I, Nikolopoulos V, Mpardis G, Loumos V (2009) Dropout prediction in e-learning courses through the combination of machine learning techniques. Comput Educ 53:950–965

    Article  Google Scholar 

  • Macfadyen LP, Dawson S (2010) Mining LMS data to develop an early warning system for educators: a proof of concept. Comput Educ 54:588–599

    Article  Google Scholar 

  • Mallinckrodt B, Sedlacek WE (1987) Student retention and the use of campus facilities by race. NASPA J 24:28–32

    Google Scholar 

  • Marquez-Vera C, Cano A, Romero C, Ventura S (2013) Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38:315–330

    Article  Google Scholar 

  • Mellalieu PJ (2011) Predicting success, excellence and retention from student’s early course performance: progress results from a data mining decision coverage system in a first year tertiary education programme. In: XXIX international conference of the international council for higher education

  • Nandeshwar A, Menzies T, Nelson A (2011) Learning patterns of university student retention. Expert Syst Appl 38:14984–14996

    Article  Google Scholar 

  • Nara A, Barlow E, Crisp G (2005a) Student persistence and degree attainment beyond the first year in college: the need for research. In: Seidman A (ed) College student retention. Praeger, Westport, pp 129–153

    Google Scholar 

  • Nara A, Barlow E, Crisp G (2005b) Student persistence and degree attainment beyond the first year in college: The need for research. In: Student College (ed) Alan Seidman. Praeger, Retention, pp 129–153

    Google Scholar 

  • National Audit Office (2007) Staying the course: the retention of students in higher education

  • Pittman K (2008) Comparison of data mining techniques used to predict student retention. Doctoral dissertation, Nova Southeastern University, Fort Lauderdale

  • Schmitt N, Oswald FL, Kim BH, Imus A, Merritt S, Friede A, Shivpuri S (2007) The use of background and ability profiles to predict college student outcomes. J Appl Psychol 92(1):165–179

    Article  Google Scholar 

  • Senator T (2005) Multi-phase classification. In: Proceedings of the fifth IEEE international conference on data mining, pp 386–393. [NOTE: tie linkage-analysis to clustering]

  • Sewell W, Wegner E (1970) Selection and context as factors affecting the probability of graduation from college. Am J Sociol 75(4):665–679

    Article  Google Scholar 

  • Superby JF, Vandamme J-P, Meskens N (2006) Determination of factors influencing the achievement of the first-year university students using data mining methods. In: Workshop on educational data mining

  • Thomas E, Galambos N (2004) What satisfies students? Mining student-opinion data with regression and decision tree analysis. Res High Educ 45(3):251–269

    Article  Google Scholar 

  • Tinto V (1975) Dropout from higher education: a theoretical synthesis of recent research. Rev Educ Res 45:89–125

    Article  Google Scholar 

  • Tinto V (1993) Leaving college: rethinking the causes and curses of student attrition. University of Chicago Press, Chicago

    Google Scholar 

  • Tinto V (2006) Research and practice of student retention: what next? J Coll Stud Retent 8(1):1–19

    Article  Google Scholar 

  • Tinto V, Russo P, Kadel S (1994) Constructing educational communities in challenging circumstances. Community Coll J 64(1):26–30

    Google Scholar 

  • Van Nelson C, Neff K (1990) Comparing and contrasting neural network solutions to classical statistical solutions. Paper presented at the Midwestern Educational Research Association Conference, Chicago, Oct. 19:1990

    Google Scholar 

  • Wang Xizhao, He Qiang, Chen Degang (2005) A genetic algorithm for solving the inverse problem of support vector machines. Sci Direct 68:225–238

    Google Scholar 

  • Witten IH, Frank E (2005) Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  • Wu X, Kumar V, Quinlan JR, Ghosg J, Yang Q, Motada H, McLachlan GJ, Ng A, Lui B, Yu PS, Zhou Z, Steibach M, Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining. In: IEEE, international conference, survey paper

  • UCI Repository of machine learning databases and domain theories. FTP address: ftp://www.ftp.ics.uci.edu/pub/machine-learning-databases. UC Irvine Machine Learning Repository (UCI) http://www.archive.ics.uci.edu/ml/. Accessed 15 Apr 2018

  • Yadav KS, Bharadway B, Pal S (2012) Mining Education data to predict student’s retention: a comparative study. Int J Comput Sci Inf Secur 10(2):113–117

    Google Scholar 

  • Yu HC, DiGangi S, Jannasch-Pennell A, Kaprolet C (2010) A data mining approach for identifying predictors of student retention from sophomore to junior year. J Data Sci 8:307–325

    Google Scholar 

  • Yu H, Ni J, Dan Y, Xu S (2012) Mining and integrating reliable decision rules for imbalance cancer gene expression data sets. Tsinghua Sci Technol 17(6): 666–673. ISBN: 1007-021407/10

  • Zhang Y, Oussena S, Clark T, Kim HT (2010) Use data mining to improve student retention in higher education—a case study. In: 12th international conference on enterprise information systems 2010, Paper Nr-129

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soma Datta.

Ethics declarations

Conflict of interest

All the authors declare that they have no conflicts of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by S. Deb, T. Hanne, K. C. Wong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Datta, S., Mengel, S. Adaptable multi-phase rules over the infrequent class. Soft Comput 22, 6067–6076 (2018). https://doi.org/10.1007/s00500-018-3399-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3399-z

Keywords

Navigation