Second Order Training and Sizing for the Multilayer Perceptron

Tyagi, Kanishka; Nguyen, Son; Rawat, Rohit; Manry, Michael

doi:10.1007/s11063-019-10116-7

Second Order Training and Sizing for the Multilayer Perceptron

Published: 08 October 2019

Volume 51, pages 963–991, (2020)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Kanishka Tyagi ORCID: orcid.org/0000-0002-6104-1645¹,
Son Nguyen¹,
Rohit Rawat² &
…
Michael Manry¹

222 Accesses
9 Citations
Explore all metrics

Abstract

An algorithm is developed for automated training of a multilayer perceptron with two nonlinear layers. The initial algorithm approximately minimizes validation error with respect to the numbers of both hidden units and training epochs. A median filtering approach is added to reduce deviations between validation and testing errors. Next, the mean-squared error objective function is modified for use with classifiers using a method similar to Ho–Kashyap. Then, both theoretical and practical reasons are provided for introducing growing steps into the algorithm. Lastly, a sigmoidal input layer is added to limit the effects of input outliers and further improve the method. Using widely available datasets, the final network’s average testing error is shown to be less than that of several other competing algorithms reported in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Robust Multilayer Perceptrons: Robust Loss Functions and Their Derivatives

An unsupervised learning approach for multilayer perceptron networks

Article 26 November 2018

Edwin Aldana-Bobadila, Angel Kuri-Morales, … Ana B. Rios-Alvarado

The Best Neural Network Architecture

References

Alchemy-API, IBM Watson (2016). https://www.ibm.com/watson/alchemy-api.html
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bailey RR, Pettit EJ, Borochoff RT, Manry MT, Jiang X (1993) Automatic recognition of USGS land use/cover categories using statistical and neural network classifiers. In: Optical engineering and photonics in aerospace sensing, pp 185–195. International Society for Optics and Photonics
Bartlett MS, Littlewort G, Frank M, Lainscsek C, Fasel I, Movellan J (2005) Recognizing facial expression: machine learning and application to spontaneous behavior. In: IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 2, pp 568–573. IEEE
Beliakov G, Kelarev A, Yearwood J (2011) Robust artificial neural networks and outlier detection. technical report. arXiv preprint arXiv:1110.0169
Bishop CM (2006) Pattern recognition and machine learning (information science and statistics). Springer, Berlin
Google Scholar
Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 24(3):131–151
Google Scholar
Bose I, Mahapatra RK (2001) Business data mining—a machine learning perspective. Inf Manag 39(3):211–225
Google Scholar
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv CSUR 41(3):15
Google Scholar
Charalambous C (1992) Conjugate gradient algorithm for efficient training of artificial neural networks. IEE Proc G Circuits Dev Syst 139(3):301–310
Google Scholar
Chen M-S, Manry Michael T (1991) Basis vector analyses of back-propagation neural networks. In: Proceedings of the 34th Midwest symposium on circuits and systems, 1991, pp 23–26. IEEE
Chen S, Cowan CFN, Grant PM (1991) Orthogonal least squares learning algorithm for radial basis function networks. IEEE Trans Neural Netw 2(2):302–309
Google Scholar
Chollet F et al (2015) Keras. https://github.com/keras-team/keras
Choudhry R, Garg K (2008) A hybrid machine learning system for stock market forecasting. World Acad Sci Eng Technol 39(3):315–318
Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Google Scholar
Cover TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron Comput 3:326–334
Google Scholar
Delashmit WH, Manry MT (2007) A neural network growing algorithm that ensures monotonically non increasing error. Adv Neural Netw 14:280–284
Google Scholar
Deng L, Hinton G, Kingsbury B (2013) New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8599–8603. IEEE
Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, Hoboken
Google Scholar
Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9(Aug):1871–1874
Google Scholar
Finlayson BA (2013) The method of weighted residuals and variational principles, vol 73. SIAM, Philadelphia
Google Scholar
Fletcher R (2013) Practical methods of optimization. Wiley, Hoboken
Google Scholar
Fukunaga K (2013) Introduction to statistical pattern recognition. Academic Press, Cambridge
Google Scholar
Gallagher N, Wise G (1981) A theoretical analysis of the properties of median filters. IEEE Trans Acoust Speech Signal Process 29(6):1136–1141
Google Scholar
Gan G (2013) Application of data clustering and machine learning in variable annuity valuation. Insurance Math Econ 53(3):795–801
Google Scholar
Golub GH, Van Loan CF (2012) Matrix computations, vol 3. JHU Press, Baltimore
Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Google Scholar
Goodfellow IJ, Koenig N, Muja M, Pantofaru C, Sorokin A, Takayama L (2010) Help me help you: interfaces for personal robots. In: Proceedings of the 5th ACM/IEEE international conference on human–robot interaction, pp 187–188. IEEE Press
Gore RG, Li J, Manry MT, Liu L-M, Yu C, Wei J (2005) Iterative design of neural network classifiers through regression. Int J Artif Intell Tools 14(01n02):281–301
Google Scholar
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6645–6649. IEEE
Hagiwara M (1990) Novel backpropagation algorithm for reduction of hidden units and acceleration of convergence using artificial selection. In: 1990 IJCNN international joint conference on neural networks, pp 625–630. IEEE
Hassan N, Li C, Tremayne M (2015) Detecting check-worthy factual claims in presidential debates. In: Proceedings of the 24th ACM international on conference on information and knowledge management, pp 1835–1838. ACM
Hassibi B, Stork DG, Wolff GJ (1993) Optimal brain surgeon and general network pruning. In: IEEE international conference on neural networks, 1993, pp 293–299. IEEE
Haykin S (2009) Neural networks and learning machines, vol 3. Pearson, Upper Saddle River, NJ
Google Scholar
Hestenes MR, Stiefel E (1952) Methods of conjugate gradients for solving linear systems, vol 49. NBS, Washington
Google Scholar
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Google Scholar
Ho Y-C, Kashyap RL (1965) An algorithm for linear inequalities and its applications. IEEE Trans Electron Comput 5:683–688
Google Scholar
Ho Y, Kashyap RL (1966) A class of iterative procedures for linear inequalities. SIAM J Control 4(1):112–115
Google Scholar
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Google Scholar
Huang W, Nakamori Y, Wang S-Y (2005) Forecasting stock market movement direction with support vector machine. Comput Oper Res 32(10):2513–2522
Google Scholar
Jacobs RA (1988) Increased rates of convergence through learning rate adaptation. Neural Netw 1(4):295–307
Google Scholar
Jiang X, Chen M-S, Manry MT, Dawson MS, Fung AK (1994) Analysis and optimization of neural networks for remote sensing. Remote Sens Rev 9(1–2):97–114
Google Scholar
Joshi B, Stewart K, Shapiro D (2017) Bringing impressionism to life with neural style transfer in come swim. arXiv preprint arXiv:1701.04928
Kainen PC, Kurková V, Kreinovich V, Sirisaengtaksin O (1994) Uniqueness of network parametrization and faster learning. Neural Parallel Sci Comput 2(4):459–466
Google Scholar
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Ke Q, Kanade T (2005) Robust l/sub 1/norm factorization in the presence of outliers and missing data by alternative convex programming. In: IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 1, pp 739–746. IEEE
Kendall MG, Stuart A (1968) The advanced theory of statistics: design and analysis, and time-series, vol 3. C. Griffin, Glasgow
Google Scholar
Kovalishyn VV, Tetko IV, Luik AI, Kholodovych VV, Villa AEP, Livingstone DJ (1998) Neural network studies. 3. Variable selection in the cascade-correlation learning architecture. J Chem Inf Comput Sci 38(4):651–659
Google Scholar
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images
Lawrence S, Giles CL, Tsoi AC, Back AD (1997) Face recognition: a convolutional neural-network approach. IEEE Trans Neural Netw 8(1):98–113
Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Google Scholar
LeCun Y, Denker JS, Solla SA (1990) Optimal brain damage. In: Advances in neural information processing systems, pp 598–605
LeCun YA, Bottou L, Orr GB, Müller K-R (2012) Efficient backprop. In: Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Springer, Berlin, pp 9–48
Google Scholar
Lee H, Battle A, Raina R, Ng AY (2006) Efficient sparse coding algorithms. In: Advances in neural information processing systems, pp 801–808
Li J, Manry MT, Liu L-M, Yu C, Wei J (2004) Iterative improvement of neural classifiers. In: FLAIRS conference, pp 700–705
Liano K (1996) Robust error measure for supervised neural network learning with outliers. IEEE Trans Neural Netw 7(1):246–250
Google Scholar
Liu LM, Manry MT, Amar F, Dawson MS, Fung AK (1994) Image classification in remote sensing using functional link neural networks. In: Proceedings of the IEEE southwest symposium on image analysis and interpretation, pp 54–58. IEEE
Malalur SS, Manry MT (2010) Multiple optimal learning factors for feed-forward networks. In: SPIE defense, security and sensing (DSS) conference, Orlando, FL
Malalur SS, Manry MT, Jesudhas P (2015) Multiple optimal learning factors for the multi-layer perceptron. Neurocomputing 149:1490–1501
Google Scholar
Maldonado FJ, Manry MT (2002) Optimal pruning of feedforward neural networks based upon the schmidt procedure. In: Conference record of the thirty-sixth Asilomar conference on signals, systems and computers, 2002, vol 2, pp 1024–1028. IEEE
Manry M (2016) Ee 5352 statistical signal processing lecture notes. University lecture, Department of Electrical Engineering, The University of Texas at Arlington
Manry M (2016) Ee 5353 neural networks lecture notes. University lecture, Department of Electrical Engineering, The University of Texas at Arlington
Manry MT, Dawson MS, Fung AK, Apollo SJ, Allen LS, Lyle WD, Gong W (1994) Fast training of neural networks for remote sensing. Remote Sens Rev 9(1–2):77–96
Google Scholar
Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill, Inc., New York
Google Scholar
Mnih V, Hinton GE (2010) Learning to detect roads in high-resolution aerial images. In: European conference on computer vision, pp 210–223. Springer
Mozer MC, Smolensky P (1989) Skeletonization: a technique for trimming the fat from a network via relevance assessment. In: Touretzky DS (ed) Advances in neural information processing systems, vol 1. Morgan-Kaufmann, Burlington, pp 107–115
Google Scholar
Narasimha PL, Delashmit WH, Manry MT, Li J, Maldonado F (2008) An integrated growing–pruning method for feedforward network training. Neurocomputing 71(13):2831–2847
Google Scholar
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning 2011
Ng A (2011) Sparse autoencoder. CS294A Lecture Notes 72:1–19
Google Scholar
Orr GB, Müller K-R (2003) Neural networks: tricks of the trade. Springer, Berlin
Google Scholar
Platt J et al (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif 10(3):61–74
Google Scholar
Pourreza-Shahri R, Saki F, Kehtarnavaz N, Leboulluec P, Liu H (2013) Classification of ex-vivo breast cancer positive margins measured by hyperspectral imaging. In: 2013 IEEE international conference on image processing, pp 1408–1412. IEEE
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125
Google Scholar
Rawat R, Patel JK, Manry MT (2013) Minimizing validation error with respect to network size and number of training epochs. In: The 2013 international joint conference on neural networks (IJCNN), pp 1–7. IEEE
Reed R (1993) Pruning algorithms—a survey. IEEE Trans Neural Netw 4(5):740–747
Google Scholar
Richard MD, Lippmann RP (1991) Neural network classifiers estimate bayesian a posteriori probabilities. Neural Comput 3(4):461–483
Google Scholar
Robinson MD, Manry MT (2013) Two-stage second order training in feedforward neural networks. In: FLAIRS conference
Roli F (2004) Statistical and neural classifiers: an integrated approach to design (advances in pattern recognition series) by S. Raudys. Pattern Anal Appl 7(1):114–115
Google Scholar
Sartori MA, Antsaklis PJ (1991) A simple method to derive bounds on the size and to train multilayer neural networks. IEEE Trans Neural Netw 2(4):467–471
Google Scholar
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc., Red Hook, pp 3104–3112
Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: closing the gap to human-level performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1701–1708
Tetko IV, Kovalishyn VV, Luik AI, Kasheva TN, Villa AEP, Livingstone DJ (2000) Variable selection in the cascade-correlation learning architecture. In: Gundertofte K, Jørgensen FS (eds) Molecular modeling and prediction of bioactivity. Springer, Berlin, pp 472–473
Google Scholar
Tyagi K (2012) Second order training algorithms for radial basis function neural networks. Masters thesis
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Williamson RC, Helmke U (1995) Existence and uniqueness results for neural network approximations. IEEE Trans Neural Netw 6(1):2–13
Google Scholar
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390
Google Scholar
Yau H-C, Manry MT (1991) Iterative improvement of a nearest neighbor classifier. Neural Netw 4(4):517–524
Google Scholar
Yu C, Manry MT, Li J, Narasimha PL (2006) An efficient hidden layer training method for the multilayer perceptron. Neurocomputing 70(1):525–535
Google Scholar
Zhu C, Byrd RH, Lu P, Nocedal J (1997) Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans Math Soft TOMS 23(4):550–560
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, The University of Texas at Arlington, Arlington, TX, 76010, USA
Kanishka Tyagi, Son Nguyen & Michael Manry
Hewlett Packard Enterprise, San Francisco, CA, USA
Rohit Rawat

Authors

Kanishka Tyagi
View author publications
You can also search for this author in PubMed Google Scholar
Son Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Rohit Rawat
View author publications
You can also search for this author in PubMed Google Scholar
Michael Manry
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kanishka Tyagi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Datasets

In order to evaluate the performance of all the improvements and our proposed algorithm, we used many publicly available datasets. Table 8 tabulate the specifications for these datasets. It should be noted here that all the datasets that we have used in our experiments have balanced classes.

Table 8 Specification of datasets

Full size table

1.1.1 Gongtrn Dataset

The raw data consists of images from hand printed numerals [90] collected from 3000 people by the Internal Revenue Service. We randomly chose 300 characters from each class to generate 3000 character training data. Images are 32 by 24 binary matrices. An image scaling algorithm is used to remove size variation in characters. The feature set contains 16 elements. The 10 classes correspond to 10 Arabic numerals.

1.1.2 Comf18 Dataset

The training data file is generated from segmented images [3]. Each segmented region is separately histogram equalized to 20 levels. Then the joint probability density of pairs of pixels separated by a given distance and a given direction is estimated. We use \(0^{\circ }\), \(90^{\circ }\), \(180^{\circ }\), \(270^{\circ }\) for the directions and 1, 3, and 5 pixels for the separations. The density estimates are computed for each classification window. For each separation, the co-occurrences for for the four directions are folded together to form a triangular matrix. From each of the resulting three matrices, six features are computed: angular second moment, contrast, entropy, correlation, and the sums of the main diagonal and the first off diagonal. This results in 18 features for each classification window.

1.1.3 MNIST Dataset

The digits data used in this book is taken from the MNIST data set [52], which itself was constructed by modifying a subset of the much larger dataset produced by NIST (the National Institute of Standards and Technology). It comprises a training set of 60,000 examples and a test set of 10,000 examples. The original NIST data had binary (black or white) pixels. To create MNIST,these images were size normalized to fit in a \(20 \times 20\) pixel box while preserving their aspect ratio. As a consequence of the anti-aliasing used to change the resolution of the images, the resulting MNIST digits are grey scale. These images were then centered in a \(28 \times 28\) box. This dataset is a classic within the machine learning community and has been extensively studied.

1.1.4 Google Street View Dataset

The Google street view housing numbers (SVHN) [69] is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

1.1.5 CIFAR Dataset

The CIFAR-10 dataset [49] consists of 60,000 \(32\times 32\) colour images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. The dataset is divided into five training batches and one test batch, each with 10,000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

1.1.6 COVER

This dataset [7] is contains forest cover type for a given observation (\(30 \times 30\) m cell) that was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

1.1.7 NEWS-20

The 20 Newsgroups dataset [65] is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

1.1.8 Breast Cancer

The breast cancer dataset [73] is a collection of 989 features that are reduced in dimension 462 using principal component analysis to 42 features.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tyagi, K., Nguyen, S., Rawat, R. et al. Second Order Training and Sizing for the Multilayer Perceptron. Neural Process Lett 51, 963–991 (2020). https://doi.org/10.1007/s11063-019-10116-7

Download citation

Published: 08 October 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s11063-019-10116-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Second Order Training and Sizing for the Multilayer Perceptron

Abstract

Access this article

Similar content being viewed by others

Robust Multilayer Perceptrons: Robust Loss Functions and Their Derivatives

An unsupervised learning approach for multilayer perceptron networks

The Best Neural Network Architecture

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 Datasets

1.1.1 Gongtrn Dataset

1.1.2 Comf18 Dataset

1.1.3 MNIST Dataset

1.1.4 Google Street View Dataset

1.1.5 CIFAR Dataset

1.1.6 COVER

1.1.7 NEWS-20

1.1.8 Breast Cancer

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Second Order Training and Sizing for the Multilayer Perceptron

Abstract

Access this article

Similar content being viewed by others

Robust Multilayer Perceptrons: Robust Loss Functions and Their Derivatives

An unsupervised learning approach for multilayer perceptron networks

The Best Neural Network Architecture

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Datasets

1.1.1 Gongtrn Dataset

1.1.2 Comf18 Dataset

1.1.3 MNIST Dataset

1.1.4 Google Street View Dataset

1.1.5 CIFAR Dataset

1.1.6 COVER

1.1.7 NEWS-20

1.1.8 Breast Cancer

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation