research-article

Imbalanced big data classification: a distributed implementation of SMOTE

Authors:

Avnish Kumar Rastogi,

Zamir Ahmad SiddiquiAuthors Info & Claims

Workshops ICDCN '18: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking

Article No.: 14, Pages 1 - 6

https://doi.org/10.1145/3170521.3170535

Published: 04 January 2018 Publication History

Abstract

In the domain of machine learning, quality of data is most critical component for building good models. Predictive analytics is an AI stream used to predict future events based on historical learnings and is used in diverse fields like predicting online frauds, oil slicks, intrusion attacks, credit defaults, prognosis of disease cells etc. Unfortunately, in most of these cases, traditional learning models fail to generate required results due to imbalanced nature of data. Here imbalance denotes small number of instances belonging to the class under prediction like fraud instances in the total online transactions. The prediction in imbalanced classification gets further limited due to factors like small disjuncts which get accentuated during the partitioning of data when learning at scale. Synthetic generation of minority class data (SMOTE [<u>1</u>]) is one pioneering approach by Chawla [<u>1</u>] to offset said limitations and generate more balanced datasets. Although there exists a standard implementation of SMOTE in python, it is unavailable for distributed computing environments for large datasets. Bringing SMOTE to distributed environment under spark is the key motivation for our research. In this paper we present our algorithm, observations and results for synthetic generation of minority class data under spark using Locality Sensitivity Hashing [LSH]. We were able to successfully demonstrate a distributed version of Spark SMOTE which generated quality artificial samples preserving spatial distribution¹.

References

[1]

Chawla Nitesh, et al. 2002. Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321--357.

Digital Library

[2]

Yu H, Hong S, Yang X, Ni J, Dan Y, Qin B. 2013. Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifiers. BioMed Research International, 1--13.

[3]

Elhag S, Fernández A, Bawakid A, Alshomrani S, Herrera F. 2015. On the combination of genetic fuzzy systems and pairwise learning for improving detection rates on intrusion detection systems. Expert Syst Appl 42(1), 193--202.

Digital Library

[4]

He H, García E A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263--1284.

Digital Library

[5]

Sun Y, Wong, Andrew and Mohamed Kamel, 2009. Classification of Imbalanced Data: A Review. International Journal of Pattern Recognition and Artificial Intelligence. 23, 04, 687--719.

[6]

Fernández, A., Chawla, Nitesh, García, S., Palade, V., Herrera, F. 2017 An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Complex and Intelligent Systems 250(20), 113--141.

[7]

Batista GEAPA, Prati RC, Monard, MC. 2004. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations 6, 1, 20--29.

Digital Library

[8]

Ramentol E, Vluymans S, Verbiest N, Caballero Y, Bello R, Cornelis C, Herrera F. 2015. IFROWANN: imbalanced fuzzy-rough ordered weighted average nearest neighbor classification. IEEE Trans Fuzzy Systems 23(5), 1622--1637.

[9]

Domingos P. 1999 Metacost: A general method for making classifiers cost-sensitive. Proceedings of the 5th international conference on knowledge discovery and data mining (KDD'99), 155--164.

Digital Library

[10]

D. Laney. 2001. 3D data management: Controlling data volume, velocity, and variety. Tech. rep., META Group.

[11]

Apache Spark https://spark.apache.org/docs/latest/index.html.

[12]

Prati, R.C., G.E. Batista, and M.C. Monard. 2004. Learning with class skews and small disjuncts, Advances in Artificial Intelligence-SBIA, Springer, 296--306.

[13]

Jo, T. and N. Japkowicz. 2004. Class imbalances versus small disjuncts. SIGKDD Explorer Newsletter. 6(1), 40--49.

Digital Library

[14]

Alejo, R., et al. 2013. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognition Letters. 34(4), 380--388.

Digital Library

[15]

García, S. and F. Herrera. 2009 Evolutionary undersampling for classification with imbalanced datasets. Proposals and taxonomy. Evolutionary Computation. 17(3), 275--306.

Digital Library

[16]

Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V. 2016. A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131, 191--206.

Digital Library

[17]

Hu F, Li H, Lou H, Dai J. 2014. A parallel oversampling algorithm based on NRSBoundary-SMOTE. Journal of Information & Computer Science 11(13), 4655--4665.

[18]

Antonin Guttman. 1984 R-trees: A dynamic index structure for spatial searching. SIGMOD Conference, 47--57.

Digital Library

[19]

Jon Louis Bentley. 1990. K-d trees for semidynamic point sets. Symposium on Computational Geometry.

Digital Library

[20]

W. Lu, Y. Shen, S. Chen, and B. C. Ooi. 2012. Efficient processing of k nearest neighbor joins using mapreduce. Proceedings of VLDB Endow. Vol. 5, no. 10, 1016--1027.

Digital Library

[21]

Bahmani, B., Moseley, B., Vattani, A., Ravi Kumar, Vassilvitskii, S. 2012. Scalable K means ++. Journal Proceedings of the VLDB Endowment. Vol 5 Issue 7, 622--633.

Digital Library

[22]

Indyk, P., Motwani, R. 1998. Approximate Nearest Neighbours: Towards Removing the Curse of Dimensionality. Proceedings of the thirtieth annual ACM symposium on Theory of computing, 604--613.

Digital Library

[23]

Rajaraman, Jure Leskovec, Jeffrey D. Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.

Digital Library

[24]

Slaney, M., Casey, M., 2008. Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Machine. 129--131.

[25]

Sundaramy, N., Turmukhametova, A., Satishy, N., Mostak, T, Indyk, P., Madden, S., and Dubey, P. 2013. Streaming Similarity Search over one Billion Tweets using Parallel Locality Sensitive Hashing. Proceedings of the VLDB Endowment, Vol. 6, No. 14, 1930--1941.

Digital Library

[26]

Liv, Q, Josephson, W., Wang, Z., Charikar, M., Li, K. 2007. Multi Probe LSH: Efficient Indexing for High Dimensional Similarity Search. Proceedings of the 33rd VLDB, 950--961.

Digital Library

[27]

M. Datar, N. Immorlica, P. Indyk, V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th Symposium on Computational Geometry (SCG) 253--262.

Digital Library

[28]

Huang J, Ling CX. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3), 299--310.

Digital Library

[29]

ECBDL'14 dataset. http://cruncher.ncl.ac.uk/bdcomp/

[30]

Scikit Learn. http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html.

[31]

J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera. 2011. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17:2-3, 255--287.

[32]

Abalone dataset http://sci2s.ugr.es/keel/dataset.php?cod=115.

[33]

Yeast dataset http://sci2s.ugr.es/keel/dataset.php?cod=133.

[34]

H2o https://www.h2o.ai.

[35]

Krawczyk, Bartosz. 2016. Learning from imbalanced data:Open challenges and future directions. Progress in Artificial Intelligence. Vol 5, Issue 4, 221--232.

[36]

D.S. Huang, X.-P. Zhang, G.-B. Huang. 2005. Borderline-SMOTE A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, Part I, LNCS 3644. 878 -- 887.

Digital Library

Cited By

Lee SPark I(2024)Application of Oversampling Techniques for Enhanced Transverse Dispersion Coefficient Estimation Performance Using Machine Learning RegressionWater10.3390/w1610135916:10(1359)Online publication date: 10-May-2024
https://doi.org/10.3390/w16101359
Asgari MYang WFarnaghi M(2022)Spatiotemporal data partitioning for distributed random forest algorithm: Air quality prediction using imbalanced big spatiotemporal data on spark distributed frameworkEnvironmental Technology & Innovation10.1016/j.eti.2022.10277627(102776)Online publication date: Aug-2022
https://doi.org/10.1016/j.eti.2022.102776
Srivani BSandhya NRani B(2022)A case study for performance analysis of big data stream classification using spark architectureInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01703-415:1(253-266)Online publication date: 2-Jul-2022
https://doi.org/10.1007/s13198-022-01703-4
Show More Cited By

Index Terms

Imbalanced big data classification: a distributed implementation of SMOTE
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. MapReduce algorithms

Recommendations

A Dynamic Spark-based Classification Framework for Imbalanced Big Data

Classification of imbalanced big data has assembled an extensive consideration by many researchers during the last decade. Standard classification methods poorly diagnosis the minority class samples. Several approaches have been introduced for solving ...
Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification
Abstract
Imbalanced classification is a well-known challenge faced by many real-world applications. This issue occurs when the distribution of the target variable is skewed, leading to a prediction bias toward the majority class. With the arrival of the ...
Highlights
- A novel hybrid resampling technique is presented for class-imbalanced classification.
- SMOTENN that combines ENN undersampling and SMOTE oversampling.
- Both procedures are performed on the same pass over the data.
- The MapReduce ...
Applying Threshold SMOTE Algoritwith Attribute Bagging to Imbalanced Datasets
Proceedings of the 8th International Conference on Rough Sets and Knowledge Technology - Volume 8171

Synthetic minority over-sampling technique SMOTE is an effective over-sampling technique and specifically designed for learning from imbalanced data sets. However, in the process of synthetic sample generation, SMOTE is of some blindness. This paper ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

Workshops ICDCN '18: Proceedings of the Workshop Program of the 19th International Conference on Distributed Computing and Networking

January 2018

151 pages

ISBN:9781450363976

DOI:10.1145/3170521

Conference Chair:
Doina Bein
California State University

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

Workshops ICDCN 2018

Workshops ICDCN 2018: Workshops co-located with the International Conference on Distributed Computing and Networks 2018

January 4 - 7, 2018

Varanasi, India

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
344
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)3

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee SPark I(2024)Application of Oversampling Techniques for Enhanced Transverse Dispersion Coefficient Estimation Performance Using Machine Learning RegressionWater10.3390/w1610135916:10(1359)Online publication date: 10-May-2024
https://doi.org/10.3390/w16101359
Asgari MYang WFarnaghi M(2022)Spatiotemporal data partitioning for distributed random forest algorithm: Air quality prediction using imbalanced big spatiotemporal data on spark distributed frameworkEnvironmental Technology & Innovation10.1016/j.eti.2022.10277627(102776)Online publication date: Aug-2022
https://doi.org/10.1016/j.eti.2022.102776
Srivani BSandhya NRani B(2022)A case study for performance analysis of big data stream classification using spark architectureInternational Journal of System Assurance Engineering and Management10.1007/s13198-022-01703-415:1(253-266)Online publication date: 2-Jul-2022
https://doi.org/10.1007/s13198-022-01703-4
Selukar MJain PKumar T(2021)A device for effective weed removal for smart agriculture using convolutional neural networkInternational Journal of System Assurance Engineering and Management10.1007/s13198-021-01441-z13:S1(397-404)Online publication date: 9-Nov-2021
https://doi.org/10.1007/s13198-021-01441-z
Bauder RKhoshgoftaar T(2020)A study on rare fraud predictions with big Medicare claims fraud dataIntelligent Data Analysis10.3233/IDA-18441524:1(141-161)Online publication date: 18-Feb-2020
https://doi.org/10.3233/IDA-184415
Rodriguez-Torres FCarrasco-Ochoa JMartínez-Trinidad J(2019)Deterministic oversampling methods based on SMOTEJournal of Intelligent & Fuzzy Systems10.3233/JIFS-17904136:5(4945-4955)Online publication date: 14-May-2019
https://doi.org/10.3233/JIFS-179041
Hasanin TKhoshgoftaar TLeevy JSeliya N(2019)Examining characteristics of predictive models with imbalanced big dataJournal of Big Data10.1186/s40537-019-0231-26:1Online publication date: 31-Jul-2019
https://doi.org/10.1186/s40537-019-0231-2
Herland MBauder RKhoshgoftaar T(2019)The effects of class rarity on the evaluation of supervised healthcare fraud detection modelsJournal of Big Data10.1186/s40537-019-0181-86:1Online publication date: 28-Feb-2019
https://doi.org/10.1186/s40537-019-0181-8
Hasanin TKhoshgoftaar TLeevy JSeliya N(2019)Investigating Random Undersampling and Feature Selection on Bioinformatics Big Data2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService)10.1109/BigDataService.2019.00063(346-356)Online publication date: Apr-2019
https://doi.org/10.1109/BigDataService.2019.00063
Triguero IGarcía‐Gil DMaillo JLuengo JGarcía SHerrera F(2018)Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality dataWIREs Data Mining and Knowledge Discovery10.1002/widm.12899:2Online publication date: 28-Nov-2018
https://doi.org/10.1002/widm.1289

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten