Abstract
Learning from data that are too big to fit into memory poses great challenges to currently available learning approaches. Averaged n-Dependence Estimators (AnDE) allows for a flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence, AnDE is especially appropriate for learning from large quantities of data. Memory requirement in AnDE, however, increases combinatorially with the number of attributes and the parameter n. In large data learning, number of attributes is often large and we also expect high n to achieve low-bias classification. In order to achieve the lower bias of AnDE with higher n but with less memory requirement, we propose a memory constrained selective AnDE algorithm, in which two passes of learning through training examples are involved. The first pass performs attribute selection on super parents according to available memory, whereas the second one learns an AnDE model with parents only on the selected attributes. Extensive experiments show that the new selective AnDE has considerably lower bias and prediction error relative to A\(n'\)DE, where \(n' = n-1\), while maintaining the same space complexity and similar time complexity. The proposed algorithm works well on categorical data. Numerical data sets need to be discretized first.

Similar content being viewed by others
References
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Brain D, Webb GI (2002) The need for low bias algorithms in classification learning from large data sets. In: Elomaa T, Mannila H, Toivonen H (eds) Proceedings of the 6th European Conference on Principles of data mining and knowledge discovery. Springer, pp 62–73
Cestnik B (1990) Estimating probabilities: a crucial task in machine learning. ECAI 90:147–149
Chen S, Martinez AM, Webb GI (2014) Highly scalable attribute selection for averaged one-dependence estimators. In: Proceedings of the 18th Pacific-Asia conference on knowledge discovery and data mining, pp 86–97. Springer
Domingos P, Pazzani M (1996) Beyond independence: conditions for the optimality of the simple Bayesian classifier. In: Proceedings of 13th international conference on machine learning, pp 105–112
Duda RO, Hart PE (1973) Pattern classification and scene analysis, 1st edn. Wiley, New York
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1022–1029
Flores M, Gmez J, Martnez A, Puerta J (2011) Handling numeric attributes when comparing bayesian network classifiers: does the discretization method matter? Appl Intell 34(3):372–385
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2–3):131–163
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. doi:10.1145/1656274.1656278
Schmidtmann I, Hammer G, Sariyar M, Gerhold-Ay A (2009) Evaluation des krebsregisters nrw—schwerpunkt record linkage—abschlussbericht. Tech. rep., Institut für medizinische Biometrie, Epidemiologie und Informatik, Universitätsmedizin Mainz
Jiang L, Zhang H (2006) Weightily averaged one-dependence estimators. In: PRICAI 2006: trends in artificial intelligence, pp 970–974. Springer
Kaluža B, Mirchevska V, Dovgan E, Luštrek M, Gams M (2010) An agent-based approach to care in independent living. In: Proceedings of the first international joint conference on ambient intelligence. Am I’10, Springer, Berlin, pp 177–186
Kohavi R, Wolpert DH (1996) Bias plus variance decomposition for zero-one loss functions. In: Proceedings of the thirteenth international conference on machine learning, pp 275–283. Morgan Kaufman Publishers, Inc
MacKay DJ (2003) Information theory, inference and learning algorithms. Cambridge University Press, Cambridge
Petitjean F, Inglada J, Gançarski P (2012) Satellite image time series analysis under time warping. IEEE Trans Geosci Remote Sens 50(8):3081–3095
Reiss A, Stricker D (2012) Creating and benchmarking a new dataset for physical activity monitoring. In: Proceedings of the 5th international conference on PErvasive Technologies Related to Assistive Environments, PETRA ’12, pp 40:1–40:8. ACM, New York, NY, USA
Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46
Sahami M (1996) Learning limited dependence Bayesian classifiers. In: Proceedings of the second international conference on knowledge discovery and data mining, pp 335–338
Sonnenburg S, Franc V (2010) COFFIN: a computational framework for linear SVMs. In: Proc. ICML 2010
Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast SVM training on very large data sets. J Mach Learn Res 6:363–392
Webb GI, Boughton JR, Wang Z (2005) Not so naive bayes: aggregating one-dependence estimators. Mach Learn 58(1):5–24
Webb GI, Boughton JR, Zheng F, Ting KM, Salem H (2012) Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification. Mach Learn 86(2):233–272
Yang Y, Korb K, Ting KM, Webb GI (2005) Ensemble selection for superparent-one-dependence estimators. In: AI 2005: advances in artificial intelligence, pp 102–112. Springer
Yang Y, Webb GI, Cerquides J, Korb KB, Boughton J, Ting KM (2007) To select or to weigh: a comparative study of linear combination schemes for superparent-one-dependence estimators. IEEE Trans Knowl Data Eng 19(12):1652–1665
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. Proc Twent Int Conf Mach Learn 3:856–863
Zaidi NA, Webb GI (2013) Fast and effective single pass Bayesian learning. In: Pei J, Tseng VS, Cao L, Motoda H, Xu G (eds) Proceedings of the 17th Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 149–160
Zheng F, Webb GI (2007) Finding the right family: parent and child selection for averaged one-dependence estimators. In: Machine learning: ECML 2007, pp 490–501. Springer
Zheng F, Webb GI, Suraweera P, Zhu L (2012) Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning. Mach Learn 87(1):93–125
Zheng Z, Webb GI (2000) Lazy learning of Bayesian rules. Mach Learn 41(1):53–84
Acknowledgments
This research has been supported by the Australian Research Council under Grant DP140100087, Asian Office of Aerospace Research and Development, Air Force Office of Scientific Research under contract FA23861214030, National Natural Science Foundation of China under Grant 61202135, 61272209, Natural Science Foundation of Jiangsu, China under Grant BK20130735, Natural Science Foundation of Jiangsu Higher Education Institutions of China under Grant 14KJB520019, 13KJB520011, 13KJB520013, the open project program of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Priority Academic Program Development of Jiangsu Higher Education Institutions. This research has also been supported in part by the Monash e-Research Center and eSolutions-Research Support Services through the use of the Monash Campus HPC Cluster and the LIEF Grant. This research was also undertaken on the NCI National Facility in Canberra, Australia, which is supported by the Australian Commonwealth Government.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Table of RMSE
See Table 8.
Appendix 2: Table of zero-one loss
See Table 9.
Appendix 3: Table of bias and variance
See Table 10.
Appendix 4: Table of computing time
See Table 11.
Rights and permissions
About this article
Cite this article
Chen, S., Martínez, A.M., Webb, G.I. et al. Selective AnDE for large data learning: a low-bias memory constrained approach. Knowl Inf Syst 50, 475–503 (2017). https://doi.org/10.1007/s10115-016-0937-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0937-9