Skip to main content
Log in

Second Order Training of a Smoothed Piecewise Linear Network

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

In this paper, we introduce a smoothed piecewise linear network (SPLN) and develop second order training algorithms for it. An embedded feature selection algorithm is developed which minimizes training error with respect to distance measure weights. Then a method is presented which adjusts center vector locations in the SPLN. We also present a gradient method for optimizing the SPLN output weights. Results with several data sets show that the distance measure optimization, center vector optimization, and output weight optimization, individually and together, reduce testing errors in the final network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Aksoy S, Haralick R, Cheikh F, Gabbouj M (2000) A weighted distance approach to relevance feedback. In: International conference on pattern recognition, vol 15, pp 812–815

  2. Bandyopadhyay S, Maulik U (2002) An evolutionary technique based on K-means algorithm for optimal clustering in RN. Inf Sci 146(1–4):221–237

    Article  MATH  Google Scholar 

  3. Bellman R (1957) Dynamic programming. Princeton University Press, Princeton

    MATH  Google Scholar 

  4. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont

    MATH  Google Scholar 

  5. Brotherton T, Johnson T (2001) Anomaly detection for advanced military aircraft using neural networks. In: Proceedings of 2001 IEEE aerospace conference

  6. Cai X, Tyagi K, Manry MT (2011) An optimal construction and training of second order RBF network for approximation and illumination invariant image segmentation. In: The 2011 international joint conference on neural networks (IJCNN), pp 3120–3126

  7. Chandrasekaran H, Li J, Delashmit WH, Narasimha PL, Yu C, Manry MT (2007) Convergent design of piecewise linear neural networks. Neurocomputing 70(4):1022–1039. http://www.sciencedirect.com/science/article/pii/S0925231206002372

  8. Chang H, Yeung DY (2008) Robust path-based spectral clustering. Pattern Recognit 41(1):191–203. doi:10.1016/j.patcog.2007.04.010. http://www.sciencedirect.com/science/article/pii/S0031320307002038

  9. Chen G, Teboulle M (1994) A proximal-based decomposition method for convex minimization problems. Math Program 64(1):81–101. doi:10.1007/BF01582566

    Article  MathSciNet  MATH  Google Scholar 

  10. Chien MJ, Kuh E (1977) Solving nonlinear resistive networks using piecewise-linear analysis and simplicial subdivision. IEEE Trans Circuits Syst 24(6):305–317. doi:10.1109/TCS.1977.1084349

    Article  MathSciNet  MATH  Google Scholar 

  11. Cormen TH (2009) Introduction to algorithms. MIT Press, Cambridge

    MATH  Google Scholar 

  12. Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547–553. doi:10.1016/j.dss.2009.05.016

    Article  Google Scholar 

  13. Craven MW, Shavlik JW (1997) Using neural networks for data mining. FGCS Future Gener Comput Syst 13(2–3):211–229

    Article  Google Scholar 

  14. Dawson MS, Olvera J, Fung AK, Manry MT (1992) Inversion of surface parameters using fast learning neural networks. In: IGARSS’92, pp 910–912

  15. Dawson MS, Fung AK, Manry MT (1993) Surface parameter retrieval using fast learning neural networks. Remote Sens Rev 7(1):1–18

    Article  Google Scholar 

  16. Dettman JW (1988) Mathematical methods in physics and engineering. Dover Publications, New York

    MATH  Google Scholar 

  17. Du Q, Faber V, Gunzburger M (1999) Centroidal Voronoi tessellations: applications and algorithms. SIAM Rev 41(4):637–676. doi:10.1137/S0036144599352836

    Article  MathSciNet  MATH  Google Scholar 

  18. Fan J, Li R (2006) Statistical challenges with high dimensionality: feature selection in knowledge discovery. arXiv:math/0602133

  19. Fan J, Fan Y, Lv J (2008) High dimensional covariance matrix estimation using a factor model. J Econom 147(1):186–197. http://www.sciencedirect.com/science/article/pii/S0304407608001346

  20. Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 1–67. http://www.jstor.org/stable/2241837

  21. Fujisawa T, Kuh ES (1972) Piecewise-linear theory of nonlinear networks. SIAM J Appl Math 22(2):307–328. doi:10.1137/0122030

    Article  MathSciNet  MATH  Google Scholar 

  22. Guyon I (1991) Applications of neural networks to character recognition. Int J Pattern Recognit Artif Intell 5(1):353–382

    Article  Google Scholar 

  23. Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw 5(6):989–993

    Article  Google Scholar 

  24. Hammer B, Villmann T (2002) Generalized relevance learning vector quantization. Neural Netw 15(8):1059–1068. http://www.sciencedirect.com/science/article/pii/S0893608002000795

  25. Haykin S (1994) Neural networks a comprehensive foundation. Macmillan [u.a.], New York. iD: 263545311

  26. Karthikeyan M, Glen RC, Bender A (2005) General melting point prediction based on a diverse compound data set and artificial neural networks. J Chem Inf Model 45(3):581–590. http://pubs.acs.org/doi/abs/10.1021/ci0500132

  27. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  28. Kuhn M (2013) QSARdata: quantitative structure activity relationship (QSAR) data sets. https://CRAN.R-project.org/package=QSARdata. R package version 1.3

  29. Lawrence S, Giles CL, Tsoi AC, Back AD (1997) Face recognition: a convolutional neural-network approach. IEEE Trans Neural Netw 8(1):98–113

    Article  Google Scholar 

  30. Levenberg K (1944) A method for the solution of certain non-linear problems in least squares. Q Appl Math 2(2):164–168

    Article  MathSciNet  MATH  Google Scholar 

  31. Lewis FL, Jagannathan S, Yeildirek A (1998) Neural network control of robot manipulators and nonlinear systems. CRC, Boca Raton

    Google Scholar 

  32. Li J, Manry MT, Narasimha PL, Yu C (2006) Feature selection using a piecewise linear network. IEEE Trans Neural Netw 17(5):1101–1115

    Article  Google Scholar 

  33. Lichman M (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml

  34. Lu H, Setiono R, Liu H (1996) Effective data mining using neural networks. IEEE Trans Knowl Data Eng 8(6):957–961

    Article  Google Scholar 

  35. Luo ZQ, Tseng P (1992) On the convergence of the coordinate descent method for convex differentiable minimization. J Optim Theory Appl 72(1):7–35. doi:10.1007/BF00939948

    Article  MathSciNet  MATH  Google Scholar 

  36. Maldonado FJ, Manry MT (2002) Optimal pruning of feedforward neural networks based upon the Schmidt procedure. In: Asilomar conference on signals systems and computers, IEEE; 1998, vol 2, pp 1024–1028

  37. Marquardt DW (1963) An algorithm for least-squares estimation of nonlinear parameters. J Soc Ind Appl Math 11(2):431–441

    Article  MathSciNet  MATH  Google Scholar 

  38. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179

    Article  Google Scholar 

  39. Nerrand O, Roussel-Ragot P, Personnaz L, Dreyfus G, Marcos S (1993) Neural networks and nonlinear adaptive filtering: unifying concepts and new algorithms. Neural Comput 5(2):165–199

    Article  Google Scholar 

  40. Nocedal J, Wright S (2006) Numerical optimization. Springer, Berlin

    MATH  Google Scholar 

  41. Oh Y, Sarabandi K, Ulaby FT (1992) An empirical model and an inversion technique for radar scattering from bare soil surfaces. IEEE Trans Geosci Remote Sens 30(2):370–381

    Article  Google Scholar 

  42. Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables, vol 30. SIAM, Philadelphia

    MATH  Google Scholar 

  43. Pea JM, Lozano JA, Larraaga P (1999) An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recognit Lett 20(10):1027–1040. doi:10.1016/S0167-8655(99)00069-0. http://www.sciencedirect.com/science/article/pii/S0167865599000690

  44. Samworth RJ (2012) Optimal weighted nearest neighbour classifiers. Ann Stat 40(5):2733–2763. doi:10.1214/12-AOS1049. http://projecteuclid.org/euclid.aos/1359987536

  45. Selim SZ, Ismail MA (1984) K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6(1):81–87

    Article  MATH  Google Scholar 

  46. Shepherd AJ (2012) Second-order methods for neural networks: fast and reliable training methods for multi-layer perceptrons. Springer, Berlin

    Google Scholar 

  47. Franti P et al (2015) Clustering datasets. http://cs.uef.fi/sipu/datasets/

  48. Subbarayan S, Kim KK, Manry MT, Devarajan V, Chen HH (1996) Modular neural network architecture using piece-wise linear mapping. In: 1996 Conference record of the thirtieth Asilomar conference on signals, systems and computers, 1996, pp 1171–1175

  49. Tikhonov AN, Arsenin VI (1977) Solutions of ill-posed problems. Winston, Washington

    MATH  Google Scholar 

  50. Turner R (2016) deldir: delaunay triangulation and dirichlet (Voronoi) tessellation. CRAN. https://CRAN.R-project.org/package=deldir. R package version 0.1-12

  51. Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999

    Article  Google Scholar 

  52. Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339. doi:10.1109/29.21701

  53. Wang YJ, Lin CT (1998) A second-order learning algorithm for multilayer networks based on block Hessian matrix. Neural Netw 11(9):1607–1622

    Article  Google Scholar 

  54. White H (1988) Economic prediction using neural networks: the case of IBM daily stock returns. Proc IEEE Int Conf Neural Netw 2:451–458

    Article  Google Scholar 

  55. Wilson CL, Candela GT, Watson CI (1994) Neural network fingerprint classification. J Artif Neural Netw 1(2):203–228

    Google Scholar 

  56. Yeh IC (1998) Modeling of strength of high-performance concrete using artificial neural networks. Cem Concr Res 28(12):1797–1808. http://www.sciencedirect.com/science/article/pii/S0008884698001653

Download references

Acknowledgements

We thank the reviewers for their insightful suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rohit Rawat.

Additional information

The research was partially sponsored by the National Science Foundation Award CMMI-1434401.

Appendix

Appendix

1.1 Calculations for Distance Measure Optimization

Taking the gradient of the error in Eq. (12) with respect to the distance measure weight change element \(e_{b}(v)\),

$$\begin{aligned} \frac{\partial E}{\partial e_{b}\left( v \right) } = - \frac{2}{N_{{\mathrm {v}}}}\sum _{p = 1}^{N_{{\mathrm {v}}}}{\sum _{i = 1}^{M}{\left[t_{p}\left( i \right) - y_{p}\left( i \right) \right]\frac{\partial y_{p}\left( i \right) }{\partial e_{b}\left( v \right) }}} \end{aligned}$$
(36)

where

$$\begin{aligned} \frac{\partial y_{p}\left( i \right) }{\partial e_{b}\left( v \right) } = \sum _{k = 1}^{K}{\frac{\partial \theta \left( k \right) }{\partial e_{b}\left( v \right) } y_{\text {pk}}(i)} \end{aligned}$$

Here, we represent \(d_{p}\left( k \right) \) with \(d_{k}\) to improve readability.

$$\begin{aligned}&\frac{\partial \theta \left( k \right) }{\partial e_{b}\left( v \right) } = \frac{D\frac{\partial d_{k}^{- a}}{\partial e_{b}\left( v \right) } - d_{k}^{- a}\sum _{m = 1}^{K}\frac{\partial d_{m}^{- a}}{\partial e_{b}\left( v \right) }}{D^{2}}\\&\frac{\partial d_{k}^{- a}}{\partial e_{b}\left( v \right) } = - \frac{a}{d_{k}^{a + 1}}\frac{\partial d_{k}}{\partial e_{b}\left( v \right) }\\&\frac{\partial d_{k}}{\partial e_{b}\left( v \right) } = \left( x_{p}\left( v \right) - m_{k}\left( v \right) \right) ^{2} \end{aligned}$$

The elements of the Gauss–Newton Hessian matrix \({\mathbf {H}}_{b}\) are calculated as

$$\begin{aligned} h_{b}\left( u,v \right) = \frac{\partial ^{2}E}{\partial e_{b}\left( u \right) \ \partial e_{b}(v)} = \frac{2}{N_{{\mathrm {v}}}}\sum _{p = 1}^{N_{{\mathrm {v}}}}{\sum _{i = 1}^{M}{\frac{\partial y_{p}\left( i \right) }{\partial e_{b}\ (u)} \frac{\partial y_{p}\left( i \right) }{\partial e_{b}\ (v)}}} \end{aligned}$$
(37)

1.2 Calculations for Center Vector Optimization

The gradient of the SPLN error from Eq. (12) with respect to the uth cluster’s center vector element \({\mathbf {m}}_{u}(v)\) is calculated as:

$$\begin{aligned} g_{m}\left( u,v \right) = \frac{\partial E}{\partial m_{u}\left( v \right) } = - \frac{2}{N_{{\mathrm {v}}}}\sum _{p = 1}^{N_{{\mathrm {v}}}}{\sum _{i = 1}^{M}{\left[t_{p}\left( i \right) - y_{p}\left( i \right) \right]\frac{\partial y_{p}\left( i \right) }{\partial m_{u}(v)}}} \end{aligned}$$
(38)

where

$$\begin{aligned}&\frac{\partial y_{p}\left( i \right) }{\partial m_{u}(v)} = \sum _{k = 1}^{K}{\frac{\partial \theta \left( k \right) }{\partial m_{u}\left( v \right) } \ y_{\text {pk}}(i)}\\&\frac{\partial \theta \left( k \right) }{\partial m_{u}\left( v \right) } = \frac{\delta \left( u - k \right) \ D\frac{\partial d_{u}^{- a}}{\partial m_{u}\left( v \right) } - d_{k}^{- a}\frac{\partial d_{u}^{- a}}{\partial m_{u}\left( v \right) }}{D^{2}}\\&\frac{\partial d_{u}^{- a}}{\partial m_{u}\left( v \right) } = \frac{2 \; a}{d_{u}^{a + 1}} \ b(v) \left( x_{p}\left( v \right) - m_{u}\left( v \right) \right) \end{aligned}$$

The elements of the Gauss–Newton Hessian matrix \({\mathbf {H}}_{m}\) are given as

$$\begin{aligned} h_m\left( u,v \right) = \frac{\partial ^{2}E}{\partial z_{m}\left( u \right) \ \partial z_{m}(v)} = \frac{2}{N_{{\mathrm {v}}}}\sum _{p = 1}^{N_{{\mathrm {v}}}}{\sum _{i = 1}^{M}{\frac{\partial y_{p}\left( i \right) }{\partial z_{m}(u)} \frac{\partial y_{p}\left( i \right) }{\partial z_{m}(v)}}} \end{aligned}$$
(39)

and the gradient of the error with respect to the learning factor elements

$$\begin{aligned} g_{zm}\left( u \right) = \frac{\partial E}{\partial z_{m}\left( u \right) } = - \frac{2}{N_{{\mathrm {v}}}}\sum _{p = 1}^{N_{{\mathrm {v}}}}{\sum _{i = 1}^{M}{\left[t_{p}\left( i \right) - y_{p}\left( i \right) \right]\frac{\partial y_{p}\left( i \right) }{\partial z_{m}\left( u \right) }}} \end{aligned}$$
(40)

where

$$\begin{aligned}&\frac{\partial y_{p}\left( i \right) }{\partial z_{m}\left( u \right) } = \sum _{k = 1}^{K}{\frac{\partial \theta \left( k \right) }{\partial z_{m}\left( u \right) } y_{\text {pk}}(i)}\\&\frac{\partial \theta \left( k \right) }{\partial z_{m}(u)} = \frac{\delta \left( u - k \right) \, D\frac{\partial d_{u}^{- a}}{\partial z_{m}(u)} - d_{k}^{- a}\frac{\partial d_{u}^{- a}}{\partial z_{m}(u)}}{D^{2}}\\&\frac{\partial d_{u}^{- 1}}{\partial z_{m}(u)} = \frac{2 \, a}{d_{u}^{a + 1}}\sum _{v = 1}^{N}{b(v) \left( x_{p}\left( v \right) - m_{u}\left( v \right) \right) g_{m}(u,v)} \end{aligned}$$

1.3 Description of Datasets

(1), (2) Red and White wine quality data sets The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine [12]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

(3) Twod data set This training file is used in the task of inverting the surface scattering parameters from an inhomogeneous layer above a homogeneous half space, where both interfaces are randomly rough. The data file contains 2768 patterns. It has eight inputs and seven outputs [14, 15].

(4) Three spirals data This is a synthetic data set used in [8], which consists of three two-dimensional spirals, each labelled for a different class. It was converted to a regression problem by decoding the classes as binary outputs. It is available at [47].

(5) Oh7 data This data set is given in [41]. The training set contains VV and HH polarization at L-band \(30^{\circ }\), \(40^{\circ }\), C-band \(10^{\circ }\), \(30^{\circ }\), \(40^{\circ }\), \(50^{\circ }\), \(60^{\circ }\), and X-band \(30^{\circ }\), \(40^{\circ }\), \(50^{\circ }\) along with the corresponding unknowns rms surface height, surface correlation length, and volumetric soil moisture content in g / cubic cm. The file has 20 inputs, 3 outputs and 10,453 training patterns.

(6) Melting point data This data set comes from [26] where a robust and general model is developed for the prediction of melting points. It has a diverse set of 4401 examples of compounds with 202 descriptors that capture molecular physicochemical and other graph-based properties. It is included in the R package QSARdata [28].

(7) Concrete data This data set predicts compressive strength of high performance concrete from its components and age. It comprises of eight inputs, the first seven being the quantities of cement, slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate in kg/m3. The eighth input is age in days. The output variable is the compressive strength in MPa [56].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rawat, R., Manry, M.T. Second Order Training of a Smoothed Piecewise Linear Network. Neural Process Lett 46, 915–942 (2017). https://doi.org/10.1007/s11063-017-9618-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-017-9618-2

Keywords

Navigation