Skip to main content
Log in

Dynamic swarm class rebalancing for the process mining of rare events

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Process mining is becoming an indispensable method in workflow model reconstructions, offering insights into mission critical systems. The efficacy of process mining depends on whether the underlying data mining algorithms can accurately classify or predict future events from process logs. However, exceptional events are scarce in most operational processes. Hence, the process logs generated from these processes are highly imbalanced. It is quite often that any model learned from imbalanced data tends to be overly generalized toward the normal classes but under-trained to recognize the rare classes. In this paper, we propose 3 methods to rectify this class imbalance problem. They are founded upon a meta-heuristic–swarm intelligence algorithm. The first method, and also the base of the remaining 2 methods, is Dynamic Multi-objective Rebalancing Algorithm, which considers both high accuracy and high confidence level of classification in its objective function, and it is draw upon the particle swarm optimization algorithm. The other two algorithms are hybrid methods by combining the first base method with over-sampling and under-sampling techniques. Experiments are conducted using the three above-mentioned methods to process rebalanced dataset, as well as using other classic resampling methods for comparison. According to the results, our proposed methods show satisfactory performance over other comparison methods, and we extracted meaningful decision rules from a rebalanced dataset in process mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39(3):3446–3453

    Article  Google Scholar 

  2. Padmaja TM et al (2007) Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection. In: Advanced computing and communications, ADCOM 2007. International Conference on. 2007. IEEE.

  3. Amin A et al (2016) Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access 4:7940–7957

    Article  Google Scholar 

  4. Yu H, Ni J, Zhao J (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318

    Article  Google Scholar 

  5. Mazurowski MA et al (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural netw 21(2):427–436

    Article  Google Scholar 

  6. Cook JE, Wolf AL (1995) Automating process discovery through event-data analysis. In: Software engineering, ICSE 1995. 17th International Conference on. 1995. IEEE.

  7. Cook JE, Wolf AL (1999) Software process validation: quantitatively measuring the correspondence of a process to a model. ACM Trans Softw Eng Methodol (TOSEM) 8(2):147–176

    Article  Google Scholar 

  8. Agrawal, R., Gunopulos D, Leymann F (1998) Mining process models from workflow logs. In: International Conference on Extending Database Technology. Springer.

  9. Masseglia F, Teisseire M, Poncelet P (2003) HDM: a client/server/engine architecture for real-time web usage mining. Knowl Inf Syst 5(4):439–465

    Article  Google Scholar 

  10. Van der Aalst W, Weijters T, Maruster L (2004) Workflow mining: discovering process models from event logs. IEEE Trans Knowl Data Eng 16(9):1128–1142

    Article  Google Scholar 

  11. Luna JM, Romero JR, Ventura S (2014) On the adaptability of G3PARM to the extraction of rare association rules. Knowl Inf Syst 38(2):391

    Article  Google Scholar 

  12. van der Aalst WM et al (2007) Business process mining: an industrial application. Inf Syst 32(5):713–732

    Article  Google Scholar 

  13. Măruşter L, van Beest NR (2009) Redesigning business processes: a methodology based on experiment and process mining techniques. Knowl Inf Syst 21(3):267–297

    Article  Google Scholar 

  14. Wang H, Wang S (2008) A knowledge management approach to data mining process for business intelligence. Ind Manage Data Syst 108(5):622–634

    Article  Google Scholar 

  15. Charaniya S et al (2010) Mining manufacturing data for discovery of high productivity process characteristics. J Biotechnol 147(3):186–197

    Article  Google Scholar 

  16. Graham R (2010) Sturd, business process reengineering:strategies for occupational health and safety. Cambridge Scholars Publishing, Cambridge

    Google Scholar 

  17. Degenhardt Mark. (2011). Metric Development for Continuous Process Improvement (2011). Theses and Dissertations, AFIT Scholar. 1491. https://scholar.afit.edu/etd/1491

  18. Gaaloul W, Baïna K, Godart C (2008) Log-based mining techniques applied to web service composition reengineering. SOCA 2(2–3):93–110

    Article  Google Scholar 

  19. Chu, V.W., et al. (2014) Web service orchestration topic mining. In Web Services (ICWS), 2014 IEEE International Conference on. IEEE.

  20. Liang QA et al. (2006) Service pattern discovery of web service mining in web service registry-repository. in E-business Engineering,. ICEBE'06. IEEE International Conference On. 2006. IEEE.

  21. Bhiri S, Gaaloul W, Godart C (2008) Mining and improving composite web services recovery mechanisms. Int J Web Serv Res 5(2):23

    Article  Google Scholar 

  22. Zheng G, Bouguettaya A (2009) Service mining on the web. IEEE Trans Serv Comput 2(1):65–78

    Article  Google Scholar 

  23. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215

    Article  Google Scholar 

  24. Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  25. Parsaei MR, Rostami SM, Javidan R (2016) A hybrid data mining approach for intrusiondetection on imbalanced NSL-KDD dataset. Int J Adv Comput Sci Appl 7(6):20–25

    Google Scholar 

  26. Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. In Neural Networks (IJCNN), The 2010 International Joint Conference on. IEEE.

  27. Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36

    Article  MathSciNet  Google Scholar 

  28. Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II. Citeseer.

  29. Liu X-Y, J Wu, Z-H Zhou (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern, B 39(2): p. 539–550.

  30. Sun Y et al (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378

    Article  Google Scholar 

  31. Galar M et al. (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C (Applications and Reviews) 42(4): p. 463–484.

  32. Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3):321–334

    Article  Google Scholar 

  33. Kumari G (2012) A study of bagging and boosting approaches to develop meta-classifier. Eng Sci Technol Int J (ESTIJ) ISSN 2250–3498, Vol.2, No. 5 850–855

  34. Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. University of California, Berkeley, p 110

    Google Scholar 

  35. Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Data Mining, ICDM'06. Sixth International Conference on. 2006. IEEE.

  36. Fan W et al. (1999) AdaCost: misclassification cost-sensitive boosting. In: Icml.

  37. Chawla NV et al. (2003) SMOTEBoost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer.

  38. Cao P, Zhao D, Zaiane O (2013) An optimized cost-sensitive svm for imbalanced data learning, PAKDD 2013: advances in knowledge discovery and data mining pp 280–292.

  39. Martens D, Baesen B, Fawcett T (2011) Editorial survey: swarm intelligence for data mining . J Mach Learn 82:1–42

    Article  MathSciNet  Google Scholar 

  40. Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization. Swarm Intell 1(1):33–57

    Article  Google Scholar 

  41. Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets.

  42. Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: international conference on intelligent computing. Springer.

  43. Maratea A, Petrosino A, Manzo M (2014) Adjusted F-measure and kernel scaling for imbalanced data learning. Inf Sci 257(331):341

    Google Scholar 

  44. Tang Y et al. (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Sys Man Cybern B (Cybernetics) 39(1): p. 281–288.

  45. Viera AJ, Garrett JM (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37(5):360–363

    Google Scholar 

  46. Li J, et al. (2016) Adaptive multi-objective swarm crossover optimization for imbalanced data classification. In: Advanced data mining and applications: 12th international conference, ADMA 2016, Gold Coast, QLD, Australia, December 12–15, Proceedings 12. 2016. Springer.

  47. Li J et al (2017) Adaptive multi-objective swarm fusion for imbalanced data classification. Inf Fusion 39:1–24

    Article  Google Scholar 

  48. Li J, Fong S, Zhuang Y (2015) Optimizing SMOTE by metaheuristics with neural network and decision tree. In: Computational and Business Intelligence (ISCBI), 2015 3rd International Symposium on. IEEE.

  49. Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics p.159–174.

  50. Chen Y-W, Lin C-J (2006) Combining SVMs with various feature selection strategies, In: Feature extraction. Springer. p. 315–324.

  51. Li J et al (2016a) Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 72(10):3708–3728

    Article  Google Scholar 

  52. Coello CAC, Lamont GB, Van Veldhuizen DA (2007) Evolutionary algorithms for solving multi-objective problems. Vol. 5. Springer.

  53. Fong S et al (2014) Feature selection in life science classification: metaheuristic swarm search. IT Prof 16(4):24–29

    Article  Google Scholar 

  54. Mlambo N, Cheruiyot W, Kimwele MW (2016) A survey and comparative study of filter and wrapper feature selection techniques. Int J Eng Sci 5(8):57–67

    Google Scholar 

  55. Li J et al (2016b) Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Mining 9(1):37

    Article  Google Scholar 

  56. Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727

    Article  Google Scholar 

  57. Frank A, Asuncion A (2010) UCI machine learning repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California. School of Information and Computer Science, 213.

  58. Ding Z (2011) Diversified ensemble classifiers for highly imbalanced data learning and their application in bioinformatics.

  59. Seiffert C et al (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Hum 40(1):185–197

    Article  Google Scholar 

  60. Pooja SR (2013) A Comparative Study of Instance Reduction Techniques, Special Issue: Proceedings of 2nd International Conference on Emerging Trends in Engineering and Management, ICETEM 2013.

  61. Witten IH et al. (2016) Data Mining: practical machine learning tools and techniques. : Morgan Kaufmann.

Download references

Acknowledgement

The authors are grateful for the financial support from the Research Grants, Nature-Inspired Computing and Metaheuristics Algorithms for Optimizing Data Mining Performance, Grant no. MYRG2016-00069-FST, offered by the University of Macau, FST, and RDAO. The authors appreciate the contribution that Dr. Jingyan Li has made for this paper during his PhD period in the University of Macau, and he is now a data scientist in Huawei Technologies CO. LTD, China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaoyang Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Tables 2, 3, 4, 5, 6, 7, 8, 9, 10.

Table 2 Performances of Secom dataset with different rebalancing methods
Table 3 Statistics of the extracted rules from each method's rebalanced datasets
Table 4 Rules extracted from Secom dataset by SaCb-SDMORA-NB + JRIP
Table 5 Results of Kappa with different datasets and algorithms
Table 6 Results of Accuracy with different datasets and algorithms
Table 7 Results of BER with different datasets and algorithms
Table 8 Majority class and Minority class Variations of pre-/post-processed by different algorithms in different dataset
Table 9 Majority class and Minority class Variations of pre-/post-processed by different algorithms in different dataset
Table 10 Results of %Time with different datasets for the swarm intelligence-based algorithms

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Wu, Y., Fong, S. et al. Dynamic swarm class rebalancing for the process mining of rare events. J Supercomput 77, 7549–7583 (2021). https://doi.org/10.1007/s11227-020-03500-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03500-x

Keywords

Navigation