An efficient data augmentation algorithm generates samples that improves accuracy and robustness of training models. Augmentation with informative samples imparts meaning to the augmented data set. In this paper, we propose CoPASample (Covariance Preserving Algorithm for generating Samples), a data augmentation algorithm that generates samples which reflects the first and second order statistical information of the data set, thereby augmenting the data set in a manner that preserves the total covariance of the data. To address the issue of exponential computations in the generation of points for augmentation, we formulate an optimisation problem motivated by the approach used in \(\nu \)-SVR to iteratively compute a heuristics based optimal set of points for augmentation in polynomial time. Experimental results for several data sets and comparisons with other data augmentation algorithms validate the potential of our proposed algorithm.
R. Agrawal and P. Kothari—Contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Wang, C., Principe, J.C.: Training neural networks with additive noise in the desired signal. In: The Proceedings of the 1998 IEEE International Joint Conference on Neural Networks, pp. 1084–1089 (1998)
Brown, W.M., Gedeon, T.D., Groves, D.I.: Use of noise to augment training data: a neural network method of mineral potential mapping in regions of limited known deposit examples. Nat. Resour. Res. 12(2), 141–152 (2003)
Karystinos, G.N., Pados, D.A.: On overfitting, generalization, and randomly expanded training sets. IEEE Trans. Neural Netw. 11(5), 1050–1057 (2000)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: The Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328 (2008)
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39804-2_12
Chen, S., He, H., Garcia, E.A.: RAMOBoost: ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 21(10), 1624–1642 (2010)
Polson, N.G., Scott, S.L.: Data augmentation for support vector machines. Bayesian Anal. 6(1), 1–23 (2011)
Meng, X.L., van Dyk, D.A.: Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika 86(2), 301–320 (1999)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap, 1st edn. Chapman & Hall/CRC, Boca Raton (1993)
Ivănescu, V.C., Bertrand, J.W.M., Fransoo, J.C., Kleijnen, J.P.C.: Bootstrapping to solve the limited data problem in production control: an application in batch process industries. J. Oper. Res. Soc. 57(1), 2–9 (2006)
Tsai, T.I., Li, D.C.: Utilize bootstrap in small data set learning for pilot run modeling of manufacturing systems. Expert Syst. Appl. 35(3), 1293–1300 (2008)
Jayadeva, Soman, S., Saxena, S.: EigenSample: a non-iterative technique for adding samples to small datasets. Appl. Soft Comput. 70, 1064–1077 (2018)
Pearson, K.: On lines and planes of closest fit to systems of points in space. London Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)
David, C.C., Jacobs, D.J.: Principal component analysis: a method for determining the essential dynamics of proteins. Methods Mol. Biol. 1084, 193–226 (2014)
van Nieuwenburg, E.P.L., Liu, Y.H., Huber, S.D.: Learning phase transitions by confusion. Nat. Phys. 13, 435–439 (2017)
Yang, M.H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24(1), 34–58 (2002)
Khan, N.M., Ksantini, R., Ahmad, I.S., Guan, L.: Covariance-guided one-class support vector machine. Pattern Recogn. 47(6), 2165–2177 (2014)
Ottersten, B., Stoica, P., Roy, R.: Covariance matching estimation techniques for array signal processing applications. Digit. Signal Proc. 8(3), 185–210 (1998)
Alqallah, F.A., Konis, K.P., Martin, R.D., Zamar, R.H.: Scalable robust covariance and correlation estimates for data mining. In: The Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 14–23 (2002)
Schölkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Comput. 12(5), 1207–1245 (2000)
Dua, D., Graff, C.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2019). http://archive.ics.uci.edu/ml
SMOTE MATLAB Code. https://in.mathworks.com/matlabcentral/fileexchange/38830-smote-synthetic-minority-over-sampling-technique. Accessed 10 May 2019
ADASYN MATLAB Code. https://in.mathworks.com/matlabcentral/fileexchange/50541-adasyn-improves-class-balance-extension-of-smote. Accessed 10 May 2019
Little, M.A., McSharry, P.E., Roberts, S.J., Costello, D.A.E., Moroz, I.M.: Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomed. Eng. Online 6(23) (2007)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
Linda, O., Manic, M.: General type-2 fuzzy c-means algorithm for uncertain fuzzy clustering. IEEE Trans. Fuzzy Syst. 20(5), 883–897 (2012)
Kulkarni, S., Agrawal, R., Rhee, F.C.H.: Determining the optimal fuzzifier range for alpha-planes of general type-2 fuzzy sets. In: The Proceedings of The 2018 IEEE International Conference on Fuzzy Systems (2018)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Breiman, L.: Classification and Regression Trees, 1st edn. Chapman & Hall/CRC, New York (1984)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Bhatt, R.: Planning-Relax Dataset for Automatic Classification of EEG Signals. UCI Machine Learning Repository (2012)
The authors would like to acknowledge Dr. Sriparna Bandopadhyay (Indian Institute of Technology Guwahati) and Dr. Ayon Ganguly (Indian Institute of Technology Guwahati) for their valuable feedbacks.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Agrawal, R., Kothari, P. (2019). CoPASample: A Heuristics Based Covariance Preserving Data Augmentation. In: Nicosia, G., Pardalos, P., Umeton, R., Giuffrida, G., Sciacca, V. (eds) Machine Learning, Optimization, and Data Science. LOD 2019. Lecture Notes in Computer Science(), vol 11943. Springer, Cham. https://doi.org/10.1007/978-3-030-37599-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-37599-7_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37598-0
Online ISBN: 978-3-030-37599-7
eBook Packages: Computer ScienceComputer Science (R0)