Skip to main content
Log in

Speeding up the scaled conjugate gradient algorithm and its application in neuro-fuzzy classifier training

  • Original Paper
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

The aim of this study is to speed up the scaled conjugate gradient (SCG) algorithm by shortening the training time per iteration. The SCG algorithm, which is a supervised learning algorithm for network-based methods, is generally used to solve large-scale problems. It is well known that SCG computes the second-order information from the two first-order gradients of the parameters by using all the training datasets. In this case, the computation cost of the SCG algorithm per iteration is more expensive for large-scale problems. In this study, one of the first-order gradients is estimated from the previously calculated gradients without using the training dataset. To estimate this gradient, a least square error estimator is applied. The estimation complexity of the gradient is much smaller than the computation complexity of the gradient for large-scale problems, because the gradient estimation is independent of the size of dataset. The proposed algorithm is applied to the neuro-fuzzy classifier and the neural network training. The theoretical basis for the algorithm is provided, and its performance is illustrated by its application to several examples in which it is compared with several training algorithms and well-known datasets. The empirical results indicate that the proposed algorithm is quicker per iteration time than the SCG. The algorithm decreases the training time by 20–50% compared to SCG; moreover, the convergence rate of the proposed algorithm is similar to SCG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Abraham A (2004) Meta learning evolutionary artificial neural networks. Neurocomputing 56:1–38. doi:10.1016/S0925-2312(03)00369-2

    Article  Google Scholar 

  • Bazaraa MS, Sherali HD, Shetty CM (2006) Nonlinear programming, 3rd edn. Wiley, New York

    MATH  Google Scholar 

  • Bishop CM (1996) Neural networks for pattern recognition. Oxford University Press, New York

    MATH  Google Scholar 

  • Broyden CG (1967) Quasi-Newton methods and their applications to function minimization. Math Comput 21:368–381. doi:10.2307/2003239

    Article  MATH  MathSciNet  Google Scholar 

  • Castillo E, Guijarro-Berdiñas B, Fontenla-Romero O, Alonso-Betanzos A (2006) A very fast learning method for neural networks based on sensitivity analysis. J Mach Learn Res 7:1159–1182

    MathSciNet  Google Scholar 

  • Chuang CC, Jeng JT (2007) CPBUM neural networks for modeling with outliers and noise. Appl Soft Comput 7:957–967. doi:10.1016/j.asoc.2006.04.009

    Article  Google Scholar 

  • Demuth H, Beale M, Hagan M (2008) Neural network toolbox 6 user’s guide. Mathworks Inc, Natick

    Google Scholar 

  • Edmonson W, Principe J, Srinivasan K, Wang C (1998) A global least mean square algorithm for adaptive IIR filtering. IEEE trans. on circuits and systems-II. Analog Digit Signal Process 45(3):379–384. doi:10.1109/82.664244

    Article  Google Scholar 

  • Haykin S (2001) Kalman filtering and neural networks. Wiley, New York

    Book  Google Scholar 

  • Jang JSR (1991) Fuzzy modelling using generalized neural networks and Kalman filter algorithm. In: Proceedings of the ninth national conference on artificial intelligence (AAAI-91), pp 762–767

  • Jang JSR (1993) ANFIS: adaptive network based fuzzy inference systems. IEEE Trans Syst Man Cybern 23:665–685. doi:10.1109/21.256541

    Article  Google Scholar 

  • Jang JSR, Mizutani E (1996) Levenberg-Marquardt method for ANFIS learning. In: Proceedings of the international joint conference of the north American fuzzy information processing society biannual conference, Berkeley, pp 87–91

  • Jang JSR, Sun CT, Mizutani E (1997) Neuro-fuzzy and soft computing. Prentice Hall, Upper Saddle River

    Google Scholar 

  • Kashiyama K, Tamai T, Inomata W, Yamaguchi S (2000) A parallel finite element method for incompressible Navier-Stokes flows based on unstructured grids. Comput Methods Appl Mech Eng 190(3–4):333–344

    MATH  Google Scholar 

  • Keles A, Hasiloglu AS, Keles A, Aksoy Y (2007) Neuro-fuzzy classification of prostate cancer using NEFCLASS-J. Comput Biol Med 37:1617–1628. doi:10.1016/j.compbiomed.2007.03.006

    Article  Google Scholar 

  • Le Cun Y, Galland C, Hinton GE (1989) GEMINI: gradient estimation through matrix inversion after noise injection. In: Touretzky D (ed) Advances in neural information processing systems 1 (NIPS’88). Morgan Kaufman, Denver

    Google Scholar 

  • Le Cun Y, Kanter I, Solla SA (1991) Eigenvalues of covariance matrices: application to neural network learning. Phys Rev Lett 66(18):2396–2399. doi:10.1103/PhysRevLett.66.2396

    Article  Google Scholar 

  • Levenberg K (1944) A method for the solution of certain problems in least squares. Q Appl Math 2:164–168

    MATH  MathSciNet  Google Scholar 

  • Marquardt DW (1963) An algorithm for least squares estimation of nonlinear parameters. J Soc Ind Appl Math 11:431–441. doi:10.1137/0111030

    Article  MATH  MathSciNet  Google Scholar 

  • Moghaddam HA, Matinfar M (2007) Fast adaptive LDA using quasi-Newton algorithm. Pattern Recognit Lett 28:613–621. doi:10.1016/j.patrec.2006.10.011

    Article  Google Scholar 

  • Møller M (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6(4):525–533. doi:10.1016/S0893-6080(05)80056-5

    Article  Google Scholar 

  • Møller M (1997) Efficient training of feed-forward neural networks. Ph.D. Thesis, Aarhus University, Denmark

  • Mukkamala S, Sung AH, Abraham A (2005) Intrusion detection using an ensemble of intelligent paradigms. J Netw Comput Appl 28(2):167–182. doi:10.1016/j.jnca.2004.01.003

    Article  Google Scholar 

  • Ribeiro MV, Duque CA, Romano JMT (2006) An interconnected type-1 fuzzy algorithm for impulsive noise cancellation in multicarrier-based power line communication systems. IEEE J Sel Areas Communitications 24(7):1364–1376. doi:10.1109/JSAC.2006.874417

    Article  Google Scholar 

  • Schraudolph NN (2002) Fast curvature matrix-vector products for second-order gradient descent. Neural Comput 14:1723–1738. doi:10.1162/08997660260028683

    Article  MATH  Google Scholar 

  • Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization. Math Comput 24:647–656. doi:10.2307/2004840

    Article  MathSciNet  Google Scholar 

  • Sinha SK, Fieguth PW (2006) Neuro-fuzzy network for the classification of buried pipe defects. Autom Constr 15:73–83. doi:10.1016/j.autcon.2005.02.005

    Article  Google Scholar 

  • Sözen A, Arcaklioğlu E, Özalp M, Kanit EG (2005) Solar-energy potential in Turkey. Appl Energy 8(4):367–381. doi:10.1016/j.apenergy.2004.06.001

    Article  Google Scholar 

  • Steil JJ (2006) Online stability of backpropagation–decorrelation recurrent learning. Neurocomputing 69:642–650. doi:10.1016/j.neucom.2005.12.012

    Article  Google Scholar 

  • Sun CT, Jang JSR (1993) A neuro-fuzzy classifier and its applications. In: Proceedings of IEEE international conference on fuzzy systems, San Francisco, vol 1, pp 94–98

  • Theoridis S, Koutroumbas K (2003) Pattern recognition, 2nd edn. Academic Press, London

    Google Scholar 

  • Thomas GB, Finney RL (1995) Calculus and analytic geometry, 9th edn. Addison-Wesley, Reading

    Google Scholar 

  • Toosi NA, Kahani M (2007) A new approach to intrusion detection based on an evolutionary soft computing model using neuro-fuzzy classifiers. Comput Commun 30:2201–2212. doi:10.1016/j.comcom.2007.05.002

    Article  Google Scholar 

  • Tran C, Abraham A, Jain L (2004) Decision support systems using hybrid neurocomputing. Neurocomputing 61:85–97. doi:10.1016/j.neucom.2004.03.006

    Article  Google Scholar 

  • Wang C, Principe J (1999) Training neural networks with additive noise in the desired signal. IEEE Trans Neural Netw 10(6):1511–1517. doi:10.1109/72.809097

    Article  Google Scholar 

  • Zhang P, Bui TD, Suen CY (2007) A novel cascade ensemble classifier system with a high recognition performance on handwritten digits. Pattern Recognit 40:3415–3429. doi:10.1016/j.patcog.2007.03.022

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors thank Rifat Edizkan, Omer Nezih Gerek, and the reviewers for all the useful discussions and their valuable comments on this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bayram Cetişli.

Appendix: The presentation of the complexity of the gradient estimation

Appendix: The presentation of the complexity of the gradient estimation

Our claim is that the complexity of the gradient estimation \( g_{t,k}^{est} \) is less than the complexity of the gradient calculation \( g_{t,k}^{calc} , \) that is \( O\left( {g_{t,k}^{est} } \right) < O\left( {g_{t,k}^{calc} } \right), \) where O(·) is the calculation complexity.

This can be proved with ease:

Both \( g_{t,k}^{est} \) and \( g_{t,k}^{calc} \) represent the gradient of E with respect to ρ ij .

Calculation of the operation size of \( g_{t,k}^{est} ; \)

$$ {\text{If}}\,{\mathbf{A}} = \left[ {\begin{array}{*{20}c} {\theta_{k - 2}^{2} } & {\theta_{k - 2} } & 1 \\ {\theta_{k - 1}^{2} } & {\theta_{k - 1} } & 1 \\ {\theta_{k}^{2} } & {\theta_{k} } & 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {{\varvec{\Uptheta}}_{k - 2} } \\ {{\varvec{\Uptheta}}_{k - 1} } \\ {{\varvec{\Uptheta}}_{k} } \\ \end{array} } \right],\quad {\text{then}}\,O({\mathbf{A}}) = 3\,mult $$
(9)

In Eq. 9, the “mult” term represents the operation of multiplication.

$$ {\text{If}}\,{\mathbf{F}} = \left[ {\begin{array}{*{20}c} {f_{1} } \\ {f_{2} } \\ {f_{3} } \\ \end{array} } \right]\quad {\text{and}}\quad {\mathbf{G}} = \left[ {\begin{array}{*{20}c} {g_{k - 2} } \\ {g_{k - 1} } \\ {g_{k} } \\ \end{array} } \right] \Rightarrow {\mathbf{AF}} = {\mathbf{G}} \Rightarrow {\mathbf{F}} = {\mathbf{A}}^{ - 1} {\mathbf{G}}. $$
(10)

\( g_{t,k}^{est} \) is estimated by LSE as shown below:

$$ g_{t,k}^{est} = {\varvec{\Uptheta}}_{t,k} {\mathbf{F}} = f_{1} \theta_{t,k}^{2} + f_{2} \theta_{t,k} + f_{3} . $$
(11)

If \( {\mathbf{A}} = \left[ {\begin{array}{*{20}c} {a_{11} } & {a_{12} } & {a_{13} } \\ {a_{21} } & {a_{22} } & {a_{23} } \\ {a_{31} } & {a_{32} } & {a_{33} } \\ \end{array} } \right] , \) then the determinant of A is calculated as follows:

$$ \left| {\mathbf{A}} \right| = a_{11} a_{22} a_{33} + a_{12} a_{23} a_{31} + a_{13} a_{21} a_{32} - a_{11} a_{23} a_{32} - a_{12} a_{21} a_{33} - a_{13} a_{22} a_{31} . $$
(12)

In addition \( O\left( {\left| {\mathbf{A}} \right|} \right) = 5\,add + 2 \times 6\,mult, \) where the term “add” represents the operation of the addition. The adjoint of A is given below:

$$ adj\left( {\mathbf{A}} \right) = \left\{ {\begin{array}{*{20}c} {{\mathbf{A}}_{11} = \left( { - 1} \right)^{1 + 1} = \left| {\begin{array}{*{20}c} {a_{22} } & {a_{23} } \\ {a_{32} } & {a_{33} } \\ \end{array} } \right| = a_{22} a_{33} - a_{23} a_{32} } \\ \vdots \\ {{\mathbf{A}}_{33} = \left( { - 1} \right)^{1 + 1} = \left| {\begin{array}{*{20}c} {a_{11} } & {a_{12} } \\ {a_{21} } & {a_{22} } \\ \end{array} } \right| = a_{11} a_{22} - a_{21} a_{12} } \\ \end{array} } \right.. $$
(13)

The operation size of adj(A) is \( O\left( {adj\left( {\mathbf{A}} \right)} \right) = 9 \times \left( {1\,add + 2\,mult} \right). \) Now, A −1 and its size can be determined as;

$$ {\mathbf{A}}^{ - 1} = \frac{1}{{\det \left( {\mathbf{A}} \right)}}adj\left( {\mathbf{A}} \right) = {\mathbf{B}},\quad {\text{and}}\quad O\left( {{\mathbf{A}}^{ - 1} } \right) = 14\,add + 39\,mult. $$
(14)
$$ {\mathbf{F}} = {\mathbf{A}}^{ - 1} {\mathbf{G}} = \left[ {\begin{array}{*{20}c} {b_{11} g_{k - 2} + b_{12} g_{k - 1} + b_{13} g_{k} } \\ {b_{21} g_{k - 2} + b_{22} g_{k - 1} + b_{23} g_{k} } \\ {b_{31} g_{k - 2} + b_{32} g_{k - 1} + b_{33} g_{k} } \\ \end{array} } \right]\quad {\text{and}}\quad O\left( {\mathbf{F}} \right) = 6\,add + 9\,mult. $$
(15)

Lastly, \( O\left( {g_{t,k}^{est} = {\varvec{\Uptheta}}_{t,k} {\mathbf{F}}} \right) = 2\,add + 2\,mult \) and \( Total\,O = 22\,add + 53\,mult. \)

On the other hand, the calculation of \( g_{t,k}^{calc} \) and its number of arithmetic operations are determined below:

Firstly, the output of NFC should be calculated from the first layer of NFC:

$$ \mu_{ij} = \exp \left( { - 0.5\frac{{\left( {x_{pj} - \rho_{ij} } \right)^{2} }}{{\sigma_{ij}^{2} }}} \right), $$
(16)

where \( \rho_{ij} \,\left( {\rho_{ij} \in {\varvec{\Upgamma}}_{m \times n} } \right) \) and \( \sigma_{ij} \,\left( {\sigma_{ij} \in {\varvec{\Uplambda}}_{m \times n} } \right) \) are the centre and the width of the Gaussian membership function μ ij for the ith rule and the jth feature. In Eq. 16, the exp(·) operation is approximately calculated using the Maclaurin Series;

$$ \exp \left( z \right) = \sum\limits_{u = 0}^{\infty } {\frac{{z^{u} }}{u!}} = 1 + z + \frac{{z^{2} }}{2!} + \cdots + \frac{{z^{u} }}{u!} + \cdots $$
(17)

If the upper limit for u = 10, then \( O\left( {\exp (z)} \right) = 10\,add + 117\,mult \) and \( O\left( {\mu_{ij} } \right) = 11\,add + 123\,mult . \) The output of the second layer is calculated as

$$ rule_{i} = \prod\limits_{j = 1}^{n} {\mu_{ij} } ,\quad {\text{and}}\quad O\left( {rule_{i} } \right) = n \times \left( {11\,add + 123\,mult} \right), $$
(18)

where rule i represents the firing strength of the ith rule. The output of the third layer is calculated as:

$$ o_{k} = \sum\limits_{i = 1}^{m} {rule_{i} w_{ik} } ,\quad {\text{and}}\quad O(o_{k} ) = m \times n \times \left( {11\,add + 123\,mult} \right) + m \times \left( {add + mult} \right), $$
(19)

where \( w_{ik} \,\left( {w_{ik} \in {\mathbf{W}}_{m \times c} } \right) \) represents the weight of the ith rule that belongs to the kth class; m denotes the number of rules. The output of the last layer is calculated as

$$ \begin{aligned} out_{k} & = \frac{{o_{k} }}{{\sum\nolimits_{l = 1}^{c} {o_{l} } }} = \frac{{o_{k} }}{\mathbf T},\quad {\mathbf T} = \sum\limits_{l = 1}^{c} {o_{l} } \quad {\text{and}} \\ O\left( {out_{k} } \right) & = m \times n \times \left( {11\,add + 123\,mult} \right) + \left( {m + c} \right)\,add + \left( {m + 1} \right)\,mult. \\ \end{aligned} $$
(20)

The mean square error function is used as the cost function:

$$ E = \frac{1}{N}\sum\limits_{p = 1}^{N} {E^{p} } ,\quad E^{p} = \frac{1}{2}\sum\limits_{k = 1}^{c} {\left( {y_{pk} - out_{pk} } \right)^{2} } , $$
(21)

where N represents the number of samples; c represents the number of classes; y pk and out pk are the target and the actual output of the pth sample and the kth class, respectively.\( g_{t,k}^{calc} \) of ρ ij is calculated using the chain rule:

$$ \frac{\partial E}{{\partial \rho_{ij} }} = \frac{1}{N}\sum\limits_{p = 1}^{N} {\sum\limits_{k = 1}^{c} {\left( {out_{pk} - y_{pk} } \right)\left( {\frac{{1 - out_{pk} }}{\rm T}} \right)w_{ik} rule_{i} \left( {\frac{{x_{pj} - \rho_{ij} }}{{\sigma_{ij}^{2} }}} \right)} } . $$
(22)

The total arithmetic operations of \( g_{t,k}^{calc} \) of ρ ij are given below:

$$ \begin{aligned} O\left( {g_{t,k}^{calc} } \right) & = c \times N \times \left( {m \times n \times \left( {11\,add + 123\,mult} \right) + \left( {m + c} \right)\,add + \left( {m + 1} \right)\,mult} \right) \\ & + \quad c \times N \times (9\,mult + 3\,add) + 2\left( {c \times N} \right)\,add + N \times n \times \left( {11\,add + 123\,mult} \right) \\ O\left( {g_{t,k}^{calc} } \right) & \approx \left( {c \times N \times \left( {5 + 11 \times m \times n} \right) + 11 \times N \times n} \right)\,add \\ & + \quad \left( {c \times N \times \left( {9 + 11 \times m \times n} \right) + 123 \times N \times n} \right)\,mult \\ \end{aligned} $$
(23)

The comparison of the number of arithmetic operations of the gradients is given in Eq. 24.

$$ \begin{aligned} & 22\,add + 53\,mult \ll N \times n \times \left( {\left( {c \times \left( {\frac{5}{n} + 11 \times m} \right) + 11} \right)\,add + \left( {c \times \left( {\frac{9}{n} + 11 \times m} \right) + 123} \right)} \right)\,mult \\ & {\text{If}}\,N > 100,\,m > 1,\,n > 1\;{\text{and}}\;c > 1,\;{\text{then}} \\ & O\left( {g_{t,k}^{est} } \right) \ll O\left( {g_{t,k}^{calc} } \right) \\ \end{aligned} $$
(24)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cetişli, B., Barkana, A. Speeding up the scaled conjugate gradient algorithm and its application in neuro-fuzzy classifier training. Soft Comput 14, 365–378 (2010). https://doi.org/10.1007/s00500-009-0410-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-009-0410-8

Keywords

Navigation