Abstract
Process mining is becoming an indispensable method in workflow model reconstructions, offering insights into mission critical systems. The efficacy of process mining depends on whether the underlying data mining algorithms can accurately classify or predict future events from process logs. However, exceptional events are scarce in most operational processes. Hence, the process logs generated from these processes are highly imbalanced. It is quite often that any model learned from imbalanced data tends to be overly generalized toward the normal classes but under-trained to recognize the rare classes. In this paper, we propose 3 methods to rectify this class imbalance problem. They are founded upon a meta-heuristic–swarm intelligence algorithm. The first method, and also the base of the remaining 2 methods, is Dynamic Multi-objective Rebalancing Algorithm, which considers both high accuracy and high confidence level of classification in its objective function, and it is draw upon the particle swarm optimization algorithm. The other two algorithms are hybrid methods by combining the first base method with over-sampling and under-sampling techniques. Experiments are conducted using the three above-mentioned methods to process rebalanced dataset, as well as using other classic resampling methods for comparison. According to the results, our proposed methods show satisfactory performance over other comparison methods, and we extracted meaningful decision rules from a rebalanced dataset in process mining.
Similar content being viewed by others
References
Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39(3):3446–3453
Padmaja TM et al (2007) Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection. In: Advanced computing and communications, ADCOM 2007. International Conference on. 2007. IEEE.
Amin A et al (2016) Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access 4:7940–7957
Yu H, Ni J, Zhao J (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318
Mazurowski MA et al (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural netw 21(2):427–436
Cook JE, Wolf AL (1995) Automating process discovery through event-data analysis. In: Software engineering, ICSE 1995. 17th International Conference on. 1995. IEEE.
Cook JE, Wolf AL (1999) Software process validation: quantitatively measuring the correspondence of a process to a model. ACM Trans Softw Eng Methodol (TOSEM) 8(2):147–176
Agrawal, R., Gunopulos D, Leymann F (1998) Mining process models from workflow logs. In: International Conference on Extending Database Technology. Springer.
Masseglia F, Teisseire M, Poncelet P (2003) HDM: a client/server/engine architecture for real-time web usage mining. Knowl Inf Syst 5(4):439–465
Van der Aalst W, Weijters T, Maruster L (2004) Workflow mining: discovering process models from event logs. IEEE Trans Knowl Data Eng 16(9):1128–1142
Luna JM, Romero JR, Ventura S (2014) On the adaptability of G3PARM to the extraction of rare association rules. Knowl Inf Syst 38(2):391
van der Aalst WM et al (2007) Business process mining: an industrial application. Inf Syst 32(5):713–732
Măruşter L, van Beest NR (2009) Redesigning business processes: a methodology based on experiment and process mining techniques. Knowl Inf Syst 21(3):267–297
Wang H, Wang S (2008) A knowledge management approach to data mining process for business intelligence. Ind Manage Data Syst 108(5):622–634
Charaniya S et al (2010) Mining manufacturing data for discovery of high productivity process characteristics. J Biotechnol 147(3):186–197
Graham R (2010) Sturd, business process reengineering:strategies for occupational health and safety. Cambridge Scholars Publishing, Cambridge
Degenhardt Mark. (2011). Metric Development for Continuous Process Improvement (2011). Theses and Dissertations, AFIT Scholar. 1491. https://scholar.afit.edu/etd/1491
Gaaloul W, Baïna K, Godart C (2008) Log-based mining techniques applied to web service composition reengineering. SOCA 2(2–3):93–110
Chu, V.W., et al. (2014) Web service orchestration topic mining. In Web Services (ICWS), 2014 IEEE International Conference on. IEEE.
Liang QA et al. (2006) Service pattern discovery of web service mining in web service registry-repository. in E-business Engineering,. ICEBE'06. IEEE International Conference On. 2006. IEEE.
Bhiri S, Gaaloul W, Godart C (2008) Mining and improving composite web services recovery mechanisms. Int J Web Serv Res 5(2):23
Zheng G, Bouguettaya A (2009) Service mining on the web. IEEE Trans Serv Comput 2(1):65–78
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215
Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Parsaei MR, Rostami SM, Javidan R (2016) A hybrid data mining approach for intrusiondetection on imbalanced NSL-KDD dataset. Int J Adv Comput Sci Appl 7(6):20–25
Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. In Neural Networks (IJCNN), The 2010 International Joint Conference on. IEEE.
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II. Citeseer.
Liu X-Y, J Wu, Z-H Zhou (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern, B 39(2): p. 539–550.
Sun Y et al (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378
Galar M et al. (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C (Applications and Reviews) 42(4): p. 463–484.
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3):321–334
Kumari G (2012) A study of bagging and boosting approaches to develop meta-classifier. Eng Sci Technol Int J (ESTIJ) ISSN 2250–3498, Vol.2, No. 5 850–855
Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. University of California, Berkeley, p 110
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Data Mining, ICDM'06. Sixth International Conference on. 2006. IEEE.
Fan W et al. (1999) AdaCost: misclassification cost-sensitive boosting. In: Icml.
Chawla NV et al. (2003) SMOTEBoost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer.
Cao P, Zhao D, Zaiane O (2013) An optimized cost-sensitive svm for imbalanced data learning, PAKDD 2013: advances in knowledge discovery and data mining pp 280–292.
Martens D, Baesen B, Fawcett T (2011) Editorial survey: swarm intelligence for data mining . J Mach Learn 82:1–42
Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization. Swarm Intell 1(1):33–57
Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets.
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: international conference on intelligent computing. Springer.
Maratea A, Petrosino A, Manzo M (2014) Adjusted F-measure and kernel scaling for imbalanced data learning. Inf Sci 257(331):341
Tang Y et al. (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Sys Man Cybern B (Cybernetics) 39(1): p. 281–288.
Viera AJ, Garrett JM (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37(5):360–363
Li J, et al. (2016) Adaptive multi-objective swarm crossover optimization for imbalanced data classification. In: Advanced data mining and applications: 12th international conference, ADMA 2016, Gold Coast, QLD, Australia, December 12–15, Proceedings 12. 2016. Springer.
Li J et al (2017) Adaptive multi-objective swarm fusion for imbalanced data classification. Inf Fusion 39:1–24
Li J, Fong S, Zhuang Y (2015) Optimizing SMOTE by metaheuristics with neural network and decision tree. In: Computational and Business Intelligence (ISCBI), 2015 3rd International Symposium on. IEEE.
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics p.159–174.
Chen Y-W, Lin C-J (2006) Combining SVMs with various feature selection strategies, In: Feature extraction. Springer. p. 315–324.
Li J et al (2016a) Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 72(10):3708–3728
Coello CAC, Lamont GB, Van Veldhuizen DA (2007) Evolutionary algorithms for solving multi-objective problems. Vol. 5. Springer.
Fong S et al (2014) Feature selection in life science classification: metaheuristic swarm search. IT Prof 16(4):24–29
Mlambo N, Cheruiyot W, Kimwele MW (2016) A survey and comparative study of filter and wrapper feature selection techniques. Int J Eng Sci 5(8):57–67
Li J et al (2016b) Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Mining 9(1):37
Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Frank A, Asuncion A (2010) UCI machine learning repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California. School of Information and Computer Science, 213.
Ding Z (2011) Diversified ensemble classifiers for highly imbalanced data learning and their application in bioinformatics.
Seiffert C et al (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Hum 40(1):185–197
Pooja SR (2013) A Comparative Study of Instance Reduction Techniques, Special Issue: Proceedings of 2nd International Conference on Emerging Trends in Engineering and Management, ICETEM 2013.
Witten IH et al. (2016) Data Mining: practical machine learning tools and techniques. : Morgan Kaufmann.
Acknowledgement
The authors are grateful for the financial support from the Research Grants, Nature-Inspired Computing and Metaheuristics Algorithms for Optimizing Data Mining Performance, Grant no. MYRG2016-00069-FST, offered by the University of Macau, FST, and RDAO. The authors appreciate the contribution that Dr. Jingyan Li has made for this paper during his PhD period in the University of Macau, and he is now a data scientist in Huawei Technologies CO. LTD, China.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, J., Wu, Y., Fong, S. et al. Dynamic swarm class rebalancing for the process mining of rare events. J Supercomput 77, 7549–7583 (2021). https://doi.org/10.1007/s11227-020-03500-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03500-x