Dynamic swarm class rebalancing for the process mining of rare events

Li, Jinyan; Wu, Yaoyang; Fong, Simon; Wong, Raymond K.; Chu, Victor W.; Ong, Kok-leong; Wong, Kelvin K. L.

doi:10.1007/s11227-020-03500-x

Dynamic swarm class rebalancing for the process mining of rare events

Published: 05 January 2021

Volume 77, pages 7549–7583, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jinyan Li¹,
Yaoyang Wu ORCID: orcid.org/0000-0003-2018-6730^1,2,
Simon Fong¹,
Raymond K. Wong³,
Victor W. Chu⁴,
Kok-leong Ong⁵ &
…
Kelvin K. L. Wong⁶

335 Accesses
1 Citation
Explore all metrics

Abstract

Process mining is becoming an indispensable method in workflow model reconstructions, offering insights into mission critical systems. The efficacy of process mining depends on whether the underlying data mining algorithms can accurately classify or predict future events from process logs. However, exceptional events are scarce in most operational processes. Hence, the process logs generated from these processes are highly imbalanced. It is quite often that any model learned from imbalanced data tends to be overly generalized toward the normal classes but under-trained to recognize the rare classes. In this paper, we propose 3 methods to rectify this class imbalance problem. They are founded upon a meta-heuristic–swarm intelligence algorithm. The first method, and also the base of the remaining 2 methods, is Dynamic Multi-objective Rebalancing Algorithm, which considers both high accuracy and high confidence level of classification in its objective function, and it is draw upon the particle swarm optimization algorithm. The other two algorithms are hybrid methods by combining the first base method with over-sampling and under-sampling techniques. Experiments are conducted using the three above-mentioned methods to process rebalanced dataset, as well as using other classic resampling methods for comparison. According to the results, our proposed methods show satisfactory performance over other comparison methods, and we extracted meaningful decision rules from a rebalanced dataset in process mining.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DynaQ: online learning from imbalanced multi-class streams through dynamic sampling

Article Open access 29 July 2023

A Classification Method for Imbalanced Data Based on Ant Lion Optimizer

Split Balancing (sBal)—A Data Preprocessing Sampling Technique for Ensemble Methods for Binary Classification in Imbalanced Datasets

References

Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39(3):3446–3453
Article Google Scholar
Padmaja TM et al (2007) Unbalanced data classification using extreme outlier elimination and sampling techniques for fraud detection. In: Advanced computing and communications, ADCOM 2007. International Conference on. 2007. IEEE.
Amin A et al (2016) Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access 4:7940–7957
Article Google Scholar
Yu H, Ni J, Zhao J (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318
Article Google Scholar
Mazurowski MA et al (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural netw 21(2):427–436
Article Google Scholar
Cook JE, Wolf AL (1995) Automating process discovery through event-data analysis. In: Software engineering, ICSE 1995. 17th International Conference on. 1995. IEEE.
Cook JE, Wolf AL (1999) Software process validation: quantitatively measuring the correspondence of a process to a model. ACM Trans Softw Eng Methodol (TOSEM) 8(2):147–176
Article Google Scholar
Agrawal, R., Gunopulos D, Leymann F (1998) Mining process models from workflow logs. In: International Conference on Extending Database Technology. Springer.
Masseglia F, Teisseire M, Poncelet P (2003) HDM: a client/server/engine architecture for real-time web usage mining. Knowl Inf Syst 5(4):439–465
Article Google Scholar
Van der Aalst W, Weijters T, Maruster L (2004) Workflow mining: discovering process models from event logs. IEEE Trans Knowl Data Eng 16(9):1128–1142
Article Google Scholar
Luna JM, Romero JR, Ventura S (2014) On the adaptability of G3PARM to the extraction of rare association rules. Knowl Inf Syst 38(2):391
Article Google Scholar
van der Aalst WM et al (2007) Business process mining: an industrial application. Inf Syst 32(5):713–732
Article Google Scholar
Măruşter L, van Beest NR (2009) Redesigning business processes: a methodology based on experiment and process mining techniques. Knowl Inf Syst 21(3):267–297
Article Google Scholar
Wang H, Wang S (2008) A knowledge management approach to data mining process for business intelligence. Ind Manage Data Syst 108(5):622–634
Article Google Scholar
Charaniya S et al (2010) Mining manufacturing data for discovery of high productivity process characteristics. J Biotechnol 147(3):186–197
Article Google Scholar
Graham R (2010) Sturd, business process reengineering:strategies for occupational health and safety. Cambridge Scholars Publishing, Cambridge
Google Scholar
Degenhardt Mark. (2011). Metric Development for Continuous Process Improvement (2011). Theses and Dissertations, AFIT Scholar. 1491. https://scholar.afit.edu/etd/1491
Gaaloul W, Baïna K, Godart C (2008) Log-based mining techniques applied to web service composition reengineering. SOCA 2(2–3):93–110
Article Google Scholar
Chu, V.W., et al. (2014) Web service orchestration topic mining. In Web Services (ICWS), 2014 IEEE International Conference on. IEEE.
Liang QA et al. (2006) Service pattern discovery of web service mining in web service registry-repository. in E-business Engineering,. ICEBE'06. IEEE International Conference On. 2006. IEEE.
Bhiri S, Gaaloul W, Godart C (2008) Mining and improving composite web services recovery mechanisms. Int J Web Serv Res 5(2):23
Article Google Scholar
Zheng G, Bouguettaya A (2009) Service mining on the web. IEEE Trans Serv Comput 2(1):65–78
Article Google Scholar
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215
Article Google Scholar
Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Parsaei MR, Rostami SM, Javidan R (2016) A hybrid data mining approach for intrusiondetection on imbalanced NSL-KDD dataset. Int J Adv Comput Sci Appl 7(6):20–25
Google Scholar
Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. In Neural Networks (IJCNN), The 2010 International Joint Conference on. IEEE.
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Article MathSciNet Google Scholar
Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II. Citeseer.
Liu X-Y, J Wu, Z-H Zhou (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern, B 39(2): p. 539–550.
Sun Y et al (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378
Article Google Scholar
Galar M et al. (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C (Applications and Reviews) 42(4): p. 463–484.
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15(3):321–334
Article Google Scholar
Kumari G (2012) A study of bagging and boosting approaches to develop meta-classifier. Eng Sci Technol Int J (ESTIJ) ISSN 2250–3498, Vol.2, No. 5 850–855
Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. University of California, Berkeley, p 110
Google Scholar
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Data Mining, ICDM'06. Sixth International Conference on. 2006. IEEE.
Fan W et al. (1999) AdaCost: misclassification cost-sensitive boosting. In: Icml.
Chawla NV et al. (2003) SMOTEBoost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer.
Cao P, Zhao D, Zaiane O (2013) An optimized cost-sensitive svm for imbalanced data learning, PAKDD 2013: advances in knowledge discovery and data mining pp 280–292.
Martens D, Baesen B, Fawcett T (2011) Editorial survey: swarm intelligence for data mining . J Mach Learn 82:1–42
Article MathSciNet Google Scholar
Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization. Swarm Intell 1(1):33–57
Article Google Scholar
Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets.
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: international conference on intelligent computing. Springer.
Maratea A, Petrosino A, Manzo M (2014) Adjusted F-measure and kernel scaling for imbalanced data learning. Inf Sci 257(331):341
Google Scholar
Tang Y et al. (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Sys Man Cybern B (Cybernetics) 39(1): p. 281–288.
Viera AJ, Garrett JM (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37(5):360–363
Google Scholar
Li J, et al. (2016) Adaptive multi-objective swarm crossover optimization for imbalanced data classification. In: Advanced data mining and applications: 12th international conference, ADMA 2016, Gold Coast, QLD, Australia, December 12–15, Proceedings 12. 2016. Springer.
Li J et al (2017) Adaptive multi-objective swarm fusion for imbalanced data classification. Inf Fusion 39:1–24
Article Google Scholar
Li J, Fong S, Zhuang Y (2015) Optimizing SMOTE by metaheuristics with neural network and decision tree. In: Computational and Business Intelligence (ISCBI), 2015 3rd International Symposium on. IEEE.
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics p.159–174.
Chen Y-W, Lin C-J (2006) Combining SVMs with various feature selection strategies, In: Feature extraction. Springer. p. 315–324.
Li J et al (2016a) Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 72(10):3708–3728
Article Google Scholar
Coello CAC, Lamont GB, Van Veldhuizen DA (2007) Evolutionary algorithms for solving multi-objective problems. Vol. 5. Springer.
Fong S et al (2014) Feature selection in life science classification: metaheuristic swarm search. IT Prof 16(4):24–29
Article Google Scholar
Mlambo N, Cheruiyot W, Kimwele MW (2016) A survey and comparative study of filter and wrapper feature selection techniques. Int J Eng Sci 5(8):57–67
Google Scholar
Li J et al (2016b) Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Mining 9(1):37
Article Google Scholar
Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Article Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California. School of Information and Computer Science, 213.
Ding Z (2011) Diversified ensemble classifiers for highly imbalanced data learning and their application in bioinformatics.
Seiffert C et al (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Hum 40(1):185–197
Article Google Scholar
Pooja SR (2013) A Comparative Study of Instance Reduction Techniques, Special Issue: Proceedings of 2nd International Conference on Emerging Trends in Engineering and Management, ICETEM 2013.
Witten IH et al. (2016) Data Mining: practical machine learning tools and techniques. : Morgan Kaufmann.

Download references

Acknowledgement

The authors are grateful for the financial support from the Research Grants, Nature-Inspired Computing and Metaheuristics Algorithms for Optimizing Data Mining Performance, Grant no. MYRG2016-00069-FST, offered by the University of Macau, FST, and RDAO. The authors appreciate the contribution that Dr. Jingyan Li has made for this paper during his PhD period in the University of Macau, and he is now a data scientist in Huawei Technologies CO. LTD, China.

Author information

Authors and Affiliations

Department of Computer and Information Science, University of Macau, Taipa, Macau SAR, China
Jinyan Li, Yaoyang Wu & Simon Fong
Zhuhai Institute of Advanced Technology Chinese Academy of Science, Zhuhai, China
Yaoyang Wu
School of Computer Science and Engineering, University of New South Wales, Sydney, Australia
Raymond K. Wong
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Victor W. Chu
Business School, La Trobe University, Victoria, Australia
Kok-leong Ong
School of Medicine, Western Sydney University, Campbell town, NSW, 2560, Australia
Kelvin K. L. Wong

Authors

Jinyan Li
View author publications
You can also search for this author inPubMed Google Scholar
Yaoyang Wu
View author publications
You can also search for this author inPubMed Google Scholar
Simon Fong
View author publications
You can also search for this author inPubMed Google Scholar
Raymond K. Wong
View author publications
You can also search for this author inPubMed Google Scholar
Victor W. Chu
View author publications
You can also search for this author inPubMed Google Scholar
Kok-leong Ong
View author publications
You can also search for this author inPubMed Google Scholar
Kelvin K. L. Wong
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yaoyang Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Tables 2, 3, 4, 5, 6, 7, 8, 9, 10.

Table 2 Performances of Secom dataset with different rebalancing methods

Full size table

Table 3 Statistics of the extracted rules from each method's rebalanced datasets

Full size table

Table 4 Rules extracted from Secom dataset by SaCb-SDMORA-NB + JRIP

Full size table

Table 5 Results of Kappa with different datasets and algorithms

Full size table

Table 6 Results of Accuracy with different datasets and algorithms

Full size table

Table 7 Results of BER with different datasets and algorithms

Full size table

Table 8 Majority class and Minority class Variations of pre-/post-processed by different algorithms in different dataset

Full size table

Table 9 Majority class and Minority class Variations of pre-/post-processed by different algorithms in different dataset

Full size table

Table 10 Results of %Time with different datasets for the swarm intelligence-based algorithms

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Wu, Y., Fong, S. et al. Dynamic swarm class rebalancing for the process mining of rare events. J Supercomput 77, 7549–7583 (2021). https://doi.org/10.1007/s11227-020-03500-x

Download citation

Accepted: 29 October 2020
Published: 05 January 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s11227-020-03500-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic swarm class rebalancing for the process mining of rare events

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DynaQ: online learning from imbalanced multi-class streams through dynamic sampling

A Classification Method for Imbalanced Data Based on Ant Lion Optimizer

Split Balancing (sBal)—A Data Preprocessing Sampling Technique for Ensemble Methods for Binary Classification in Imbalanced Datasets

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now