skip to main content
10.1145/3511808.3557575acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Data Oversampling with Structure Preserving Variational Learning

Published: 17 October 2022 Publication History

Abstract

Traditional oversampling methods are well explored for binary and multi-class imbalanced datasets. In most cases, the data space is adapted for oversampling the imbalanced classes. It leads to various issues like poor modelling of the structure of the data, resulting in data overlapping between minority and majority classes that lead to poor classification performance of minority class(es). To overcome these limitations, we propose a novel data oversampling architecture called Structure Preserving Variational Learning (SPVL). This technique captures an uncorrelated distribution among classes in the latent space using an encoder-decoder framework. Hence, minority samples are generated in the latent space, preserving the structure of the data distribution. The improved latent space distribution (oversampled training data) is evaluated by training an MLP classifier and testing with unseen test dataset. The proposed SPVL method is applied to various benchmark datasets with i) binary and multi-class imbalance data, ii) high-dimensional data and, iii) large or small-scale data. Extensive experimental results demonstrated that the proposed SPVL technique outperforms the state-of-the-art counterparts.

Supplementary Material

MP4 File (CIKM22-sp0347.mp4)
Class imbalance issue is a chronic headache for the machine learning community. In real-world datasets (e.g., credit fraud, cancer detection) minority class is crucial, and class imbalance directly affects the machine learning model performance on minority class(es). There are classical oversampling techniques targeted for tabular datasets and deep learning oversampling approaches for image datasets. However, there are no unified approaches that perform well on both types of datasets. This video presentation introduces a novel technique titled "Data Oversampling with Structure Preserving Variational Learning". This technique retains the covariance structure of the data, performs oversampling in the low dimensional latent space and is suited for both tabular as well as image data. The video presentation also covers interesting comparisons with six baseline techniques on six highly relevant datasets.

References

[1]
Tahira Alam, Chowdhury Farhan Ahmed, Sabit Anwar Zahin, Muhammad Asif Hossain Khan, and Maliha Tashfia Islam. 2018. An effective ensemble method for multi-class classification and regression for imbalanced data. In Industrial Conference on Data Mining. Springer, 59--74.
[2]
Hong Cao, Xiao-Li Li, Yew-KwongWoon, and See-Kiong Ng. 2011. SPO: Structure preserving oversampling for imbalanced time series classification. In 2011 IEEE 11th International Conference on Data Mining. IEEE, 1008--1013.
[3]
Cristiano L Castro and Antônio P Braga. 2013. Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE transactions on neural networks and learning systems 24, 6 (2013), 888--899.
[4]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.
[5]
Damien Dablain, Bartosz Krawczyk, and Nitesh V Chawla. 2022. DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data. IEEE Transactions on Neural Networks and Learning Systems (2022).
[6]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[7]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
[8]
Ting Guo, Xingquan Zhu, Yang Wang, and Fang Chen. 2019. Discriminative sample generation for deep imbalanced learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), August 10--16 2019, Macao, China.
[9]
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing. Springer, 878--887.
[10]
Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, 1322--1328.
[11]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[12]
Yann LeCun. 1998. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998).
[13]
Giovanni Mariani, Florian Scheidegger, Roxana Istrate, Costas Bekas, and Cristiano Malossi. 2018. Bagan: Data augmentation with balancing gan. arXiv preprint arXiv:1803.09655 (2018).
[14]
Sankha Subhra Mullick, Shounak Datta, and Swagatam Das. 2019. Generative adversarial minority oversampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1695--1704.
[15]
Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano-RUSBoost. 2010. A Hybrid Approach to Alleviating Class Imbalance? IEEE Transactions On Systems. Man, And Cybernetics-Part A: Systems And Humans 40, 1 (2010).
[16]
Nguyen Thai-Nghe, Zeno Gantner, and Lars Schmidt-Thieme. 2010. Cost-sensitive learning methods for imbalanced data. In The 2010 International joint conference on neural networks (IJCNN). IEEE, 1--8.
[17]
Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
[18]
Jian Yin, Chunjing Gan, Kaiqi Zhao, Xuan Lin, Zhe Quan, and Zhi-Jie Wang. 2020. A novel model for imbalanced data classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 6680--6687.

Index Terms

  1. Data Oversampling with Structure Preserving Variational Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management
    October 2022
    5274 pages
    ISBN:9781450392365
    DOI:10.1145/3511808
    • General Chairs:
    • Mohammad Al Hasan,
    • Li Xiong
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. class imbalance
    2. classification
    3. latent space
    4. oversampling
    5. structure preserving

    Qualifiers

    • Short-paper

    Funding Sources

    • Mphasis Cognitive Computing Centre of Excellence at IIIT Bangalore
    • Accelerated Materials Development for Manufacturing Program at A*STAR via the AME Programmatic Fund

    Conference

    CIKM '22
    Sponsor:

    Acceptance Rates

    CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 88
      Total Downloads
    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media