Skip to main content

Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework

  • Conference paper
  • First Online:
Advances in Artificial Intelligence (Canadian AI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9673))

Included in the following conference series:

Abstract

With an accelerating rate of data generation, sophisticated techniques are essential to meet scalability requirements. One of the promising avenues for handling large datasets is distributed storage and processing. Further, data summarization is a useful concept for managing large datasets, wherein a subset of the data can be used to provide an approximate yet useful representation. Consolidation of these tools can allow a distributed implementation of data summarization. In this paper, we achieve this by proposing and implementing a distributed Gaussian Mixture Model Summarization using the MapReduce framework (MR-SGMM). In MR-SGMM, we partition input data, cluster the data within each partition with a density-based clustering algorithm called DBSCAN, and for all clusters we discover SGMM core points and their features. We test the implementation with synthetic and real datasets to demonstrate its validity and efficiency. This paves the way for a scalable implementation of Summarization using Gaussian Mixture Model (SGMM).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hesabi, Z.R., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., Queiroz, C.: Data summarization techniques for big data-a survey. In: Khan, S.U., Zomaya, A.Y. (eds.) Handbook on Data Centers, pp. 1109–1152. Springer, Heidelberg (2015)

    Google Scholar 

  2. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM, vol. 6, pp. 328–339. SIAM (2006)

    Google Scholar 

  3. Yang, D., Rundensteiner, E.A., Ward, M.O.: Summarization and matching of density-based clusters in streaming environments. Proc. VLDB Endow. 5(2), 121–132 (2011)

    Article  Google Scholar 

  4. Chaoji, V., Li, G., Yildirim, H., Zaki, M.J.: ABACUS: mining arbitrary shaped clusters from large datasets based on backbone identification. In: SDM, pp. 295–306. SIAM (2011)

    Google Scholar 

  5. Bigdeli, E., Mohammadi, M., Raahemi, B., Matwin, S.: Cluster summarization with dense region detection. In: Fred, A., Dietz, J.L.G., Aveiro, D., Liu, K., Filipe, J. (eds.) IC3K 2014. CCIS, vol. 553, pp. 68–83. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  6. Ester, M., Kriegel, H.-P., Sander, J., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  7. Yu, Y., Zhao, J., Wang, X., Wang, Q., Zhang, Y.: Cludoop: an efficient distributed density-based clustering for big data using Hadoop. Int. J. Distrib. Sens. Netw. 501, 579391 (2015)

    Google Scholar 

  8. Ma, L., Gu, L., Li, B., Qiao, S., Wang, J.: MRG-DBSCAN: an improved DBSCAN clustering method based on map reduce and grid. Int. J. Database Theor. Appl. 8(2), 119–128 (2015)

    Article  Google Scholar 

  9. Kim, Y., Shim, K., Kim, M.-S., Lee, J.S.: DBCURE-MR: an efficient density-based clustering algorithm for large data using mapreduce. Inf. Syst. 42, 15–35 (2014)

    Article  Google Scholar 

  10. He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., Fan, J.: MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), pp. 473–480. IEEE (2011)

    Google Scholar 

  11. Dai, B.-R.,, Lin, I., et al.: Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In: IEEE 5th International Conference on Cloud Computing (CLOUD), pp. 59–66. IEEE (2012)

    Google Scholar 

  12. He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable mapreduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99 (2014)

    Article  MathSciNet  Google Scholar 

  13. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  14. Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  15. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  16. Karypis, G., Han, E.-H., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)

    Article  Google Scholar 

  17. United States Census Bureau. http://www2.census.gov/geo/tiger/tiger2010/

  18. Box, G.E.P., Muller, M.E.: A note on the generation of random normal deviates. Ann. Math. Stat. 29, 610–611 (1958)

    Article  MATH  Google Scholar 

  19. Dunn, J.C.: A fuzzy relative of well-separated clusters. J. Cybern. 3(3), 32–57 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  20. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)

    Article  Google Scholar 

Download references

Acknowledgement

This research was partially supported by Mitacs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arina Esmaeilpour .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Esmaeilpour, A., Bigdeli, E., Cheraghchi, F., Raahemi, B., Far, B.H. (2016). Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework. In: Khoury, R., Drummond, C. (eds) Advances in Artificial Intelligence. Canadian AI 2016. Lecture Notes in Computer Science(), vol 9673. Springer, Cham. https://doi.org/10.1007/978-3-319-34111-8_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-34111-8_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-34110-1

  • Online ISBN: 978-3-319-34111-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics