Bayesian Network Structure Learning from Big Data: A Reservoir Sampling Based Ensemble Method

Tang, Yan; Xu, Zhuoming; Zhuang, Yuanhang

doi:10.1007/978-3-319-32055-7_18

Yan Tang¹⁶,
Zhuoming Xu¹⁶ &
Yuanhang Zhuang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9645))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1738 Accesses
4 Citations

Abstract

Bayesian network (BN) learning from big datasets is potentially more valuable than learning from conventional small datasets as big data contain more comprehensive probability distributions and richer causal relationships. However, learning BNs from big datasets requires high computational cost and easily ends in failure, especially when the learning task is performed on a conventional computation platform. This paper addresses the issue of BN structure learning from a big dataset on a conventional computation platform, and proposes a reservoir sampling based ensemble method (RSEM). In RSEM, a greedy algorithm is used to determine an appropriate size of sub datasets to be extracted from the big dataset. A fast reservoir sampling method is then adopted to efficiently extract sub datasets in one pass. Lastly, a weighted adjacent matrix based ensemble method is employed to produce the final BN structure. Experimental results on both synthetic and real-world big datasets show that RSEM can perform BN structure learning in an accurate and efficient way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ben-Gal, I.: Bayesian Networks. Encyclopedia of Statistics in Quality and Reliability. Wiley, New York (2007)
Google Scholar
Zhang, Y., Zhang, Y., Swears, N., et al.: Modeling temporal interactions with interval temporal bayesian networks for complex activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(10), 2468–2483 (2013)
Article Google Scholar
Fenton, N.E., Neil, M.: A critique of software defect prediction models. IEEE Trans. Softw. Eng. 25(5), 675–689 (1999)
Article Google Scholar
Sun, S., Zhang, C., Yu, G.: A bayesian network approach to traffic flow forecasting. IEEE Trans. Intell. Trans. Syst. 7(1), 124–132 (2006)
Article Google Scholar
Al-Jarrah, O., Yoo, P., et al.: Efficient machine learning for big data: A review. Big Data Res. 2(3), 87–93 (2015)
Article Google Scholar
Fang, Q., Yue, K., Fu, X., Wu, H., Liu, W.: A mapreduce-based method for learning bayesian network from massive data. In: Ishikawa, Y., Li, J., Wang, W., Zhang, R., Zhang, W. (eds.) APWeb 2013. LNCS, vol. 7808, pp. 697–708. Springer, Heidelberg (2013)
Chapter Google Scholar
Wang, J., Tang, Y., Nguyen, M., Altintas, I.: A scalable data science workflow ap-proach for big data bayesian network learning. In: Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing (BDC 2014), pp. 16–25 (2014)
Google Scholar
Cheng, J., Greiner, R., Kelly, J., Bell, D., Liu, W.: Learning bayesian networks from data: An information-theory based approach. Artif. Intell. 137(1–2), 43–90 (2002)
Article MathSciNet MATH Google Scholar
Heckerman, D., Geiger, D., Chickering, D.: Learning bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 20, 197–243 (1995)
MATH Google Scholar
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Series in Representation and Reasoning. Morgan Kaufmann, San Mateo (1988)
MATH Google Scholar
Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)
Article Google Scholar
Jiang, L., Li, C., Cai, Z., Zhang, H.: Sampled bayesian network classifiers for class-imbalance and cost-sensitive learning. In: Proceedings of the IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 512–517 (2013)
Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Article MathSciNet MATH Google Scholar
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)
Article Google Scholar
Hasna, N.J.S.: Weighted ensemble learning of bayesian network for gene regulatory networks. Neurocomputing 150((B)), 404–416 (2015)
Google Scholar
Tang, Y., Wang, Y., Cooper, K., Li, L.: Towards big data bayesian network learning - an ensemble learning based approach. In: Proceedings of the IEEE International Congress on Big Data (BigData Congress), pp. 355–357 (2014)
Google Scholar
Chickering, D., Heckerman, D., Meek, C.: Large-sample learning of bayesian networks is np-hard. J. Mach. Learn. Res. 5, 1287–1330 (2004)
MathSciNet MATH Google Scholar
Yoo, C., Ramirez, L., Liuzzi, J.: Big data analysis using modern statistical and machine learning methods in medicine. Int. Neurourol. J. 18(2), 50–57 (2014)
Article Google Scholar
Scutari, M.: Learning bayesian networks with the bnlearn r package. J. Statist. Softw. 35(3), 1–22 (2010)
Article MathSciNet Google Scholar
Spiegelhalter, D., Cowell, R.: Learning in probabilistic expert systems. Bayesian Statistics, 4. Clarendon Press, Oxford (1992)
Google Scholar
Beinlich, I., Suermondt, H., Chavez, R., Cooper, G.: The alarm monitoring system: A case study with two probabilistic inference techniques for belief networks. In: Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine, pp. 247–256 (1989)
Google Scholar
Onisko, A.: Probabilistic Causal Models in Medicine: Application to Diagnosis of Liver Disorders. Ph.D. thesis, Institute of Biocybernetics and Biomedical Engineering, Polish Academy of Science, Warsaw (2003)
Google Scholar
Data.gov - the U.S. Government Open Data: 2009 Home Mortgage Disclosure act (HMDA) Loan Application Register (LAR) Data, Accessed December 15, 2015. http://catalog.data.gov/dataset/2009-home-mortgage-disclosure-act-hmda-loan-application-register-lar-data

Download references

Acknowledgments

This work was supported by the Natural Science Foundation of Jiangsu Province, China (Grant No. BK20141420 and Grant No. BK20140857) and the “Six Talent Peaks Program” of Jiangsu Province, China (Grant No. 2008135).

Author information

Authors and Affiliations

College of Computer and Information, Hohai University, Nanjing, 210098, China
Yan Tang, Zhuoming Xu & Yuanhang Zhuang

Authors

Yan Tang
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoming Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yuanhang Zhuang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhuoming Xu .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Harbin, China
Hong Gao
Kangwon National University, Kangwon, Korea (Republic of)
Jinho Kim
Kumamoto University, Kumamoto-shi, Japan
Yasushi Sakurai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, Y., Xu, Z., Zhuang, Y. (2016). Bayesian Network Structure Learning from Big Data: A Reservoir Sampling Based Ensemble Method. In: Gao, H., Kim, J., Sakurai, Y. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9645. Springer, Cham. https://doi.org/10.1007/978-3-319-32055-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-32055-7_18
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32054-0
Online ISBN: 978-3-319-32055-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics