Skip to main content

CoPASample: A Heuristics Based Covariance Preserving Data Augmentation

  • Conference paper
  • First Online:
Machine Learning, Optimization, and Data Science (LOD 2019)

Abstract

An efficient data augmentation algorithm generates samples that improves accuracy and robustness of training models. Augmentation with informative samples imparts meaning to the augmented data set. In this paper, we propose CoPASample (Covariance Preserving Algorithm for generating Samples), a data augmentation algorithm that generates samples which reflects the first and second order statistical information of the data set, thereby augmenting the data set in a manner that preserves the total covariance of the data. To address the issue of exponential computations in the generation of points for augmentation, we formulate an optimisation problem motivated by the approach used in \(\nu \)-SVR to iteratively compute a heuristics based optimal set of points for augmentation in polynomial time. Experimental results for several data sets and comparisons with other data augmentation algorithms validate the potential of our proposed algorithm.

R. Agrawal and P. Kothari—Contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wang, C., Principe, J.C.: Training neural networks with additive noise in the desired signal. In: The Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, pp. 1084–1089 (1998)

    Google Scholar 

  2. Brown, W.M., Gedeon, T.D., Groves, D.I.: Use of noise to augment training data: a neural network method of mineral potential mapping in regions of limited known deposit examples. Nat. Resour. Res. 12(2), 141–152 (2003)

    Article  Google Scholar 

  3. Karystinos, G.N., Pados, D.A.: On overfitting, generalization, and randomly expanded training sets. IEEE Trans. Neural Netw. 11(5), 1050–1057 (2000)

    Article  Google Scholar 

  4. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  5. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: The Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328 (2008)

    Google Scholar 

  6. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39804-2_12

    Chapter  Google Scholar 

  7. Chen, S., He, H., Garcia, E.A.: RAMOBoost: ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 21(10), 1624–1642 (2010)

    Article  Google Scholar 

  8. Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian Anal. 6(1), 1–23 (2011)

    Article  MathSciNet  Google Scholar 

  9. Meng, X.L., van Dyk, D.A.: Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika 86(2), 301–320 (1999)

    Article  MathSciNet  Google Scholar 

  10. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap, 1st edn. Chapman & Hall/CRC, Boca Raton (1993)

    Book  Google Scholar 

  11. Ivănescu, V.C., Bertrand, J.W.M., Fransoo, J.C., Kleijnen, J.P.C.: Bootstrapping to solve the limited data problem in production control: an application in batch process industries. J. Oper. Res. Soc. 57(1), 2–9 (2006)

    Article  Google Scholar 

  12. Tsai, T.I., Li, D.C.: Utilize bootstrap in small data set learning for pilot run modeling of manufacturing systems. Expert Syst. Appl. 35(3), 1293–1300 (2008)

    Article  Google Scholar 

  13. Jayadeva, Soman, S., Saxena, S.: EigenSample: a non-iterative technique for adding samples to small datasets. Appl. Soft Comput. 70, 1064–1077 (2018)

    Google Scholar 

  14. Pearson, K.: On lines and planes of closest fit to systems of points in space. London Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)

    Article  Google Scholar 

  15. David, C.C., Jacobs, D.J.: Principal component analysis: a method for determining the essential dynamics of proteins. Methods Mol. Biol. 1084, 193–226 (2014)

    Article  Google Scholar 

  16. van Nieuwenburg, E.P.L., Liu, Y.H., Huber, S.D.: Learning phase transitions by confusion. Nat. Phys. 13, 435–439 (2017)

    Article  Google Scholar 

  17. Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24(1), 34–58 (2002)

    Article  Google Scholar 

  18. Khan, N.M., Ksantini, R., Ahmad, I.S., Guan, L.: Covariance-guided one-class support vector machine. Pattern Recogn. 47(6), 2165–2177 (2014)

    Article  Google Scholar 

  19. Ottersten, B., Stoica, P., Roy, R.: Covariance matching estimation techniques for array signal processing applications. Digit. Signal Proc. 8(3), 185–210 (1998)

    Article  Google Scholar 

  20. Alqallah, F.A., Konis, K.P., Martin, R.D., Zamar, R.H.: Scalable robust covariance and correlation estimates for data mining. In: The Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 14–23 (2002)

    Google Scholar 

  21. Schölkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Comput. 12(5), 1207–1245 (2000)

    Article  Google Scholar 

  22. Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2019). http://archive.ics.uci.edu/ml

  23. SMOTE MATLAB Code. https://in.mathworks.com/matlabcentral/fileexchange/38830-smote-synthetic-minority-over-sampling-technique. Accessed 10 May 2019

  24. ADASYN MATLAB Code. https://in.mathworks.com/matlabcentral/fileexchange/50541-adasyn-improves-class-balance-extension-of-smote. Accessed 10 May 2019

  25. Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A.E., Moroz, I.M.: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomed. Eng. Online 6(23) (2007)

    Google Scholar 

  26. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)

    Article  Google Scholar 

  27. Linda, O., Manic, M.: General type-2 fuzzy c-means algorithm for uncertain fuzzy clustering. IEEE Trans. Fuzzy Syst. 20(5), 883–897 (2012)

    Article  Google Scholar 

  28. Kulkarni, S., Agrawal, R., Rhee, F.C.H.: Determining the optimal fuzzifier range for alpha-planes of general type-2 fuzzy sets. In: The Proceedings of The 2018 IEEE International Conference on Fuzzy Systems (2018)

    Google Scholar 

  29. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    MATH  Google Scholar 

  30. Breiman, L.: Classification and Regression Trees, 1st edn. Chapman & Hall/CRC, New York (1984)

    MATH  Google Scholar 

  31. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  32. Bhatt, R.: Planning-Relax Dataset for Automatic Classification of EEG Signals. UCI Machine Learning Repository (2012)

    Google Scholar 

Download references

Acknowledgement

The authors would like to acknowledge Dr. Sriparna Bandopadhyay (Indian Institute of Technology Guwahati) and Dr. Ayon Ganguly (Indian Institute of Technology Guwahati) for their valuable feedbacks.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rishabh Agrawal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Agrawal, R., Kothari, P. (2019). CoPASample: A Heuristics Based Covariance Preserving Data Augmentation. In: Nicosia, G., Pardalos, P., Umeton, R., Giuffrida, G., Sciacca, V. (eds) Machine Learning, Optimization, and Data Science. LOD 2019. Lecture Notes in Computer Science(), vol 11943. Springer, Cham. https://doi.org/10.1007/978-3-030-37599-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-37599-7_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-37598-0

  • Online ISBN: 978-3-030-37599-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics