Abstract
In this paper, we introduce a smoothed piecewise linear network (SPLN) and develop second order training algorithms for it. An embedded feature selection algorithm is developed which minimizes training error with respect to distance measure weights. Then a method is presented which adjusts center vector locations in the SPLN. We also present a gradient method for optimizing the SPLN output weights. Results with several data sets show that the distance measure optimization, center vector optimization, and output weight optimization, individually and together, reduce testing errors in the final network.
Similar content being viewed by others
References
Aksoy S, Haralick R, Cheikh F, Gabbouj M (2000) A weighted distance approach to relevance feedback. In: International conference on pattern recognition, vol 15, pp 812–815
Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inf Sci 146(1–4):221–237
Bellman R (1957) Dynamic programming. Princeton University Press, Princeton
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
Brotherton T, Johnson T (2001) Anomaly detection for advanced military aircraft using neural networks. In: Proceedings of 2001 IEEE aerospace conference
Cai X, Tyagi K, Manry MT (2011) An optimal construction and training of second order RBF network for approximation and illumination invariant image segmentation. In: The 2011 international joint conference on neural networks (IJCNN), pp 3120–3126
Chandrasekaran H, Li J, Delashmit WH, Narasimha PL, Yu C, Manry MT (2007) Convergent design of piecewise linear neural networks. Neurocomputing 70(4):1022–1039. http://www.sciencedirect.com/science/article/pii/S0925231206002372
Chang H, Yeung DY (2008) Robust path-based spectral clustering. Pattern Recognit 41(1):191–203. doi:10.1016/j.patcog.2007.04.010. http://www.sciencedirect.com/science/article/pii/S0031320307002038
Chen G, Teboulle M (1994) A proximal-based decomposition method for convex minimization problems. Math Program 64(1):81–101. doi:10.1007/BF01582566
Chien MJ, Kuh E (1977) Solving nonlinear resistive networks using piecewise-linear analysis and simplicial subdivision. IEEE Trans Circuits Syst 24(6):305–317. doi:10.1109/TCS.1977.1084349
Cormen TH (2009) Introduction to algorithms. MIT Press, Cambridge
Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553. doi:10.1016/j.dss.2009.05.016
Craven MW, Shavlik JW (1997) Using neural networks for data mining. FGCS Future Gener Comput Syst 13(2–3):211–229
Dawson MS, Olvera J, Fung AK, Manry MT (1992) Inversion of surface parameters using fast learning neural networks. In: IGARSS’92, pp 910–912
Dawson MS, Fung AK, Manry MT (1993) Surface parameter retrieval using fast learning neural networks. Remote Sens Rev 7(1):1–18
Dettman JW (1988) Mathematical methods in physics and engineering. Dover Publications, New York
Du Q, Faber V, Gunzburger M (1999) Centroidal Voronoi tessellations: applications and algorithms. SIAM Rev 41(4):637–676. doi:10.1137/S0036144599352836
Fan J, Li R (2006) Statistical challenges with high dimensionality: feature selection in knowledge discovery. arXiv:math/0602133
Fan J, Fan Y, Lv J (2008) High dimensional covariance matrix estimation using a factor model. J Econom 147(1):186–197. http://www.sciencedirect.com/science/article/pii/S0304407608001346
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 1–67. http://www.jstor.org/stable/2241837
Fujisawa T, Kuh ES (1972) Piecewise-linear theory of nonlinear networks. SIAM J Appl Math 22(2):307–328. doi:10.1137/0122030
Guyon I (1991) Applications of neural networks to character recognition. Int J Pattern Recognit Artif Intell 5(1):353–382
Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw 5(6):989–993
Hammer B, Villmann T (2002) Generalized relevance learning vector quantization. Neural Netw 15(8):1059–1068. http://www.sciencedirect.com/science/article/pii/S0893608002000795
Haykin S (1994) Neural networks a comprehensive foundation. Macmillan [u.a.], New York. iD: 263545311
Karthikeyan M, Glen RC, Bender A (2005) General melting point prediction based on a diverse compound data set and artificial neural networks. J Chem Inf Model 45(3):581–590. http://pubs.acs.org/doi/abs/10.1021/ci0500132
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Kuhn M (2013) QSARdata: quantitative structure activity relationship (QSAR) data sets. https://CRAN.R-project.org/package=QSARdata. R package version 1.3
Lawrence S, Giles CL, Tsoi AC, Back AD (1997) Face recognition: a convolutional neural-network approach. IEEE Trans Neural Netw 8(1):98–113
Levenberg K (1944) A method for the solution of certain non-linear problems in least squares. Q Appl Math 2(2):164–168
Lewis FL, Jagannathan S, Yeildirek A (1998) Neural network control of robot manipulators and nonlinear systems. CRC, Boca Raton
Li J, Manry MT, Narasimha PL, Yu C (2006) Feature selection using a piecewise linear network. IEEE Trans Neural Netw 17(5):1101–1115
Lichman M (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml
Lu H, Setiono R, Liu H (1996) Effective data mining using neural networks. IEEE Trans Knowl Data Eng 8(6):957–961
Luo ZQ, Tseng P (1992) On the convergence of the coordinate descent method for convex differentiable minimization. J Optim Theory Appl 72(1):7–35. doi:10.1007/BF00939948
Maldonado FJ, Manry MT (2002) Optimal pruning of feedforward neural networks based upon the Schmidt procedure. In: Asilomar conference on signals systems and computers, IEEE; 1998, vol 2, pp 1024–1028
Marquardt DW (1963) An algorithm for least-squares estimation of nonlinear parameters. J Soc Ind Appl Math 11(2):431–441
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179
Nerrand O, Roussel-Ragot P, Personnaz L, Dreyfus G, Marcos S (1993) Neural networks and nonlinear adaptive filtering: unifying concepts and new algorithms. Neural Comput 5(2):165–199
Nocedal J, Wright S (2006) Numerical optimization. Springer, Berlin
Oh Y, Sarabandi K, Ulaby FT (1992) An empirical model and an inversion technique for radar scattering from bare soil surfaces. IEEE Trans Geosci Remote Sens 30(2):370–381
Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables, vol 30. SIAM, Philadelphia
Pea JM, Lozano JA, Larraaga P (1999) An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recognit Lett 20(10):1027–1040. doi:10.1016/S0167-8655(99)00069-0. http://www.sciencedirect.com/science/article/pii/S0167865599000690
Samworth RJ (2012) Optimal weighted nearest neighbour classifiers. Ann Stat 40(5):2733–2763. doi:10.1214/12-AOS1049. http://projecteuclid.org/euclid.aos/1359987536
Selim SZ, Ismail MA (1984) K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6(1):81–87
Shepherd AJ (2012) Second-order methods for neural networks: fast and reliable training methods for multi-layer perceptrons. Springer, Berlin
Franti P et al (2015) Clustering datasets. http://cs.uef.fi/sipu/datasets/
Subbarayan S, Kim KK, Manry MT, Devarajan V, Chen HH (1996) Modular neural network architecture using piece-wise linear mapping. In: 1996 Conference record of the thirtieth Asilomar conference on signals, systems and computers, 1996, pp 1171–1175
Tikhonov AN, Arsenin VI (1977) Solutions of ill-posed problems. Winston, Washington
Turner R (2016) deldir: delaunay triangulation and dirichlet (Voronoi) tessellation. CRAN. https://CRAN.R-project.org/package=deldir. R package version 0.1-12
Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339. doi:10.1109/29.21701
Wang YJ, Lin CT (1998) A second-order learning algorithm for multilayer networks based on block Hessian matrix. Neural Netw 11(9):1607–1622
White H (1988) Economic prediction using neural networks: the case of IBM daily stock returns. Proc IEEE Int Conf Neural Netw 2:451–458
Wilson CL, Candela GT, Watson CI (1994) Neural network fingerprint classification. J Artif Neural Netw 1(2):203–228
Yeh IC (1998) Modeling of strength of high-performance concrete using artificial neural networks. Cem Concr Res 28(12):1797–1808. http://www.sciencedirect.com/science/article/pii/S0008884698001653
Acknowledgements
We thank the reviewers for their insightful suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
The research was partially sponsored by the National Science Foundation Award CMMI-1434401.
Appendix
Appendix
1.1 Calculations for Distance Measure Optimization
Taking the gradient of the error in Eq. (12) with respect to the distance measure weight change element \(e_{b}(v)\),
where
Here, we represent \(d_{p}\left( k \right) \) with \(d_{k}\) to improve readability.
The elements of the Gauss–Newton Hessian matrix \({\mathbf {H}}_{b}\) are calculated as
1.2 Calculations for Center Vector Optimization
The gradient of the SPLN error from Eq. (12) with respect to the uth cluster’s center vector element \({\mathbf {m}}_{u}(v)\) is calculated as:
where
The elements of the Gauss–Newton Hessian matrix \({\mathbf {H}}_{m}\) are given as
and the gradient of the error with respect to the learning factor elements
where
1.3 Description of Datasets
(1), (2) Red and White wine quality data sets The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine [12]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
(3) Twod data set This training file is used in the task of inverting the surface scattering parameters from an inhomogeneous layer above a homogeneous half space, where both interfaces are randomly rough. The data file contains 2768 patterns. It has eight inputs and seven outputs [14, 15].
(4) Three spirals data This is a synthetic data set used in [8], which consists of three two-dimensional spirals, each labelled for a different class. It was converted to a regression problem by decoding the classes as binary outputs. It is available at [47].
(5) Oh7 data This data set is given in [41]. The training set contains VV and HH polarization at L-band \(30^{\circ }\), \(40^{\circ }\), C-band \(10^{\circ }\), \(30^{\circ }\), \(40^{\circ }\), \(50^{\circ }\), \(60^{\circ }\), and X-band \(30^{\circ }\), \(40^{\circ }\), \(50^{\circ }\) along with the corresponding unknowns rms surface height, surface correlation length, and volumetric soil moisture content in g / cubic cm. The file has 20 inputs, 3 outputs and 10,453 training patterns.
(6) Melting point data This data set comes from [26] where a robust and general model is developed for the prediction of melting points. It has a diverse set of 4401 examples of compounds with 202 descriptors that capture molecular physicochemical and other graph-based properties. It is included in the R package QSARdata [28].
(7) Concrete data This data set predicts compressive strength of high performance concrete from its components and age. It comprises of eight inputs, the first seven being the quantities of cement, slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate in kg/m3. The eighth input is age in days. The output variable is the compressive strength in MPa [56].
Rights and permissions
About this article
Cite this article
Rawat, R., Manry, M.T. Second Order Training of a Smoothed Piecewise Linear Network. Neural Process Lett 46, 915–942 (2017). https://doi.org/10.1007/s11063-017-9618-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-017-9618-2