Abstract
Local learning algorithms use a neighborhood of training data close to a given testing query point in order to learn the local parameters and create on-the-fly a local model specifically designed for this query point. The local approach delivers breakthrough performance in many application domains. This paper considers local learning versions of regularization networks (RN) and investigates several options for improving their online prediction performance, both in accuracy and speed. First, we exploit the interplay between locally optimized and globally optimized hyper-parameters (regularization parameter and kernel width) each new predictor needs to optimize online. There is a substantial reduction of the operation cost in the case we use two globally optimized hyper-parameters that are common to all local models. We also demonstrate that this global optimization of the two hyper-parameters produces more accurate models than the other cases that locally optimize online either the regularization parameter, or the kernel width, or both. Then by comparing Eigenvalue decomposition (EVD) with Cholesky decomposition specifically for the local learning training and testing phases, we also reveal that the Cholesky-based implementations are faster that their EVD counterparts for all the training cases. While EVD is suitable for validating cost-effectively several regularization parameters, Cholesky should be preferred when validating several neighborhood sizes (the number of k-nearest neighbors) as well as when the local network operates online. Then, we exploit parallelism in a multi-core system for these local computations demonstrating that the execution times are further reduced. Finally, although the use of pre-computed stored local models instead of the online learning local models is even faster, this option deteriorates the performance. Apparently, there is a substantial gain in waiting for a testing point to arrive before building a local model, and hence the online local learning RNs are more accurate than their pre-computed stored local models. To support all these findings, we also present extensive experimental results and comparisons on several benchmark datasets.
Similar content being viewed by others
References
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
Poggio T, Girosi F (1990) Regularization algorithms for learning that are equivalent to multilayer networks. Science 247:978–982
Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7:219–269
Evgeniou T, Pontil M, Poggio T (2000) Regularization networks and support vector machines. Adv Comput Math 13:1–50
Kashima H, Ide T, Kato T, Sugiyama M (2009) Recent Advances and trends in large-scale kernel methods. IEICE Trans Inf Syst E92-D(7):1338–1353
Bottou L, Vapnik V (1992) Local learning algorithms. Neural Comput 4(6):888–900
Vapnik V, Bottou L (1993) Local algorithms for pattern recognition and dependencies estimation. Neural Comput 5(6):893–909
Robins A, Frean M (1998) Local learning algorithms for sequential tasks in neural networks. J Adv Comput Intell Intell Inf 2(6):221–227
Vijayakumar S, Schaal S (1998) Local adaptive subspace regression. Neural Process Lett 7(3):139–149
Vijayakumar S, Schaal S (2000) Locally weighted projection regression: an O(n) algorithm for incremental real time learning in high dimensional space. In: ACM proceedings of the 17th international conference on machine learning (ICML2000), pp 1079–1086
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global consistency. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, pp 321–328
Wu M, Schölkopf B (2007) Transductive classification via local learning regularization. In: Proceedings of the eleventh international conference on artificial intelligence and statistics
Wu M, Yu K, Yu S, Schölkopf B (2007) Local learning projections. In: ACM Proceedings of the 24th international conference on machine learning (ICML2007), pp 1039–1046
Zhang H, Berg AC, Maire M, Malik J (2006) SVM-KNN: discriminative nearest neighbor classification for visual category recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, 2, pp 2126–2136
Blanzieri E, Melgani F (2006) An adaptive SVM nearest neighbor classifier for remotely sensed imagery. In: Proceedings of the IEEE international conference on geoscience and remote sensing symposium (IGARSS 06), pp 3931–3934
Blanzieri E, Melgani F (2008) Nearest neighbor classification of remote sensing images with the maximal margin principle. IEEE Trans Geosci Remote Sens 46(6):1804–1811
Segata N, Blanzieri E, Delany SJ, Cunningham P (2009) Noise reduction for instance-based learning with a local maximal margin approach. J Intell Inf Syst 35(2):301–331
Segata N, Blanzieri E (2010) Fast and scalable local kernel machines. J Mach Learn Res 11:1883–1926
Yang T, Kecman V (2010) Face recognition with adaptive local hyperplane algorithm. Pattern Anal Appl 13(1):79–83
Cheng H, Tan P-N, Jin R (2010) Efficient algorithm for localized support vector machine. IEEE Trans Knowl Data Eng 22(4):537–549
Zakai A, Ritov Y (2009) Consistency and localizability. J Mach Learn Res 10:827–856
Hable R (2013) Universal consistency of localized versions of regularized kernel methods. J Mach Learn Res 14:153–186
Kokkinos Y, Margaritis KG (2013) Parallel and local learning for fast probabilistic neural networks in scalable data mining. In: ACM proceedings of the 6th Balkan conference in informatics (BCI 2013), pp 47–52
Yu A, Grauman K (2014) Predicting useful neighborhoods for lazy local learning. In: Neural information processing systems (NIPS 2014), pp 1916–1924
Zhang J, Feng L, Wu B (2016) Local extreme learning machine: local classification model for shape feature extraction. Neural Comput Appl. doi:10.1007/s00521-015-2008-7
Kokkinos Y, Margaritis KG (2015) Multithreaded local learning regularization neural networks for regression tasks. In: Proceedings of the 16th international conference on engineering applications of neural networks (EANN 2015), pp 129–138
Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Rifkin R, Yeo G, Poggio T (2003) Regularized least-squares classification. Nato Sci Ser Sub Ser III Comput Syst Sci 190:131–154
Cawley GC, Talbot NLC (2007) Preventing Over-fitting during model selection via Bayesian regularisation of the hyper-parameters. J Mach Learn Res 8:841–861
Rifkin RM, Lippert RA (2007) Notes on regularized least squares. Technical report, MIT Computer Science and Artificial Intelligence Laboratory
Golub GH, van Loan CF (1996) Matrix computations, 3rd edn. John Hopkins University Press, Baltimore
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2002) Numerical recipes in C++: the art of scientific computing, 2nd edn. Cambridge University Press, Cambridge
Buyya R (1999) High performance cluster computing: programming and applications, 2. Prentice Hall, Upper Saddle River
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180:2044–2064
Acknowledgments
We gratefully acknowledge the useful comments and suggestions of the anonymous reviewers that help on improving the presentation and clarity of this paper.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
For all the algorithms, we use a lot of caching to speed up the process. The training phase of each case must also find the best number of neighbors, denoted as k best. We search for the best k number in the grid {δL, 2δL,…, L max} where L max is the maximum candidate number of neighbors. A local distance matrix maintains the cached distances between the neighbor points. Based on this matrix, a cached local kernel matrix is created once for every candidate σ m value. Thus, for each L max, σ m and λ l value only one Cholesky is computed for the kernel matrix. Then, progressively the Cholesky back substitution solves for the local weights of all the k candidate values. All four local RN cases use the minimum global training errors to find the best global parameters.
In the training phase, there are 3 loops one inside another. One loop iterates through the candidate width values σ m , another loop iterates through the candidate k-neighbor values and another iterates through the candidate regularization values λ l . The ordering of the loops is important. For the best and fastest ordering in the EVD implementations the loop that tests the widths σ m must be first, followed by a second loop that iterates through the candidate k-neighbor values which inside has the last loop that validates the candidate regularization values λ l . In the Cholesky implementations, the fastest ordering of computations again has first the loop for the widths σ m , but now the second loop must iterate through the candidate regularization values λ l , and the loop for the candidate k-neighbor values must be the third.
Rights and permissions
About this article
Cite this article
Kokkinos, Y., Margaritis, K.G. Local learning regularization networks for localized regression. Neural Comput & Applic 28, 1309–1328 (2017). https://doi.org/10.1007/s00521-016-2569-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-016-2569-0