Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework

Esmaeilpour, Arina; Bigdeli, Elnaz; Cheraghchi, Fatemeh; Raahemi, Bijan; Far, Behrouz H.

doi:10.1007/978-3-319-34111-8_39

Arina Esmaeilpour^15,16,
Elnaz Bigdeli^16,17,
Fatemeh Cheraghchi^16,17,
Bijan Raahemi¹⁶ &
…
Behrouz H. Far¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9673))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

1630 Accesses
1 Citations
2 Altmetric

Abstract

With an accelerating rate of data generation, sophisticated techniques are essential to meet scalability requirements. One of the promising avenues for handling large datasets is distributed storage and processing. Further, data summarization is a useful concept for managing large datasets, wherein a subset of the data can be used to provide an approximate yet useful representation. Consolidation of these tools can allow a distributed implementation of data summarization. In this paper, we achieve this by proposing and implementing a distributed Gaussian Mixture Model Summarization using the MapReduce framework (MR-SGMM). In MR-SGMM, we partition input data, cluster the data within each partition with a density-based clustering algorithm called DBSCAN, and for all clusters we discover SGMM core points and their features. We test the implementation with synthetic and real datasets to demonstrate its validity and efficiency. This paves the way for a scalable implementation of Summarization using Gaussian Mixture Model (SGMM).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hesabi, Z.R., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., Queiroz, C.: Data summarization techniques for big data-a survey. In: Khan, S.U., Zomaya, A.Y. (eds.) Handbook on Data Centers, pp. 1109–1152. Springer, Heidelberg (2015)
Google Scholar
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SDM, vol. 6, pp. 328–339. SIAM (2006)
Google Scholar
Yang, D., Rundensteiner, E.A., Ward, M.O.: Summarization and matching of density-based clusters in streaming environments. Proc. VLDB Endow. 5(2), 121–132 (2011)
Article Google Scholar
Chaoji, V., Li, G., Yildirim, H., Zaki, M.J.: ABACUS: mining arbitrary shaped clusters from large datasets based on backbone identification. In: SDM, pp. 295–306. SIAM (2011)
Google Scholar
Bigdeli, E., Mohammadi, M., Raahemi, B., Matwin, S.: Cluster summarization with dense region detection. In: Fred, A., Dietz, J.L.G., Aveiro, D., Liu, K., Filipe, J. (eds.) IC3K 2014. CCIS, vol. 553, pp. 68–83. Springer, Heidelberg (2014)
Chapter Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xiaowei, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
Google Scholar
Yu, Y., Zhao, J., Wang, X., Wang, Q., Zhang, Y.: Cludoop: an efficient distributed density-based clustering for big data using Hadoop. Int. J. Distrib. Sens. Netw. 501, 579391 (2015)
Google Scholar
Ma, L., Gu, L., Li, B., Qiao, S., Wang, J.: MRG-DBSCAN: an improved DBSCAN clustering method based on map reduce and grid. Int. J. Database Theor. Appl. 8(2), 119–128 (2015)
Article Google Scholar
Kim, Y., Shim, K., Kim, M.-S., Lee, J.S.: DBCURE-MR: an efficient density-based clustering algorithm for large data using mapreduce. Inf. Syst. 42, 15–35 (2014)
Article Google Scholar
He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., Fan, J.: MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), pp. 473–480. IEEE (2011)
Google Scholar
Dai, B.-R.,, Lin, I., et al.: Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In: IEEE 5th International Conference on Cloud Computing (CLOUD), pp. 59–66. IEEE (2012)
Google Scholar
He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable mapreduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8(1), 83–99 (2014)
Article MathSciNet Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Karypis, G., Han, E.-H., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)
Article Google Scholar
United States Census Bureau. http://www2.census.gov/geo/tiger/tiger2010/
Box, G.E.P., Muller, M.E.: A note on the generation of random normal deviates. Ann. Math. Stat. 29, 610–611 (1958)
Article MATH Google Scholar
Dunn, J.C.: A fuzzy relative of well-separated clusters. J. Cybern. 3(3), 32–57 (1973)
Article MathSciNet MATH Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)
Article Google Scholar

Download references

Acknowledgement

This research was partially supported by Mitacs.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Calgary, 2500 University Dr. NW, Calgary, AB, Canada
Arina Esmaeilpour & Behrouz H. Far
Knowledge Discovery and Data Mining Lab, Telfer School of Management, University of Ottawa, 55 Laurier Ave. E, Ottawa, ON, Canada
Arina Esmaeilpour, Elnaz Bigdeli, Fatemeh Cheraghchi & Bijan Raahemi
Computer Science Department, University of Ottawa, 600 King Edward, Ottawa, ON, Canada
Elnaz Bigdeli & Fatemeh Cheraghchi

Authors

Arina Esmaeilpour
View author publications
You can also search for this author in PubMed Google Scholar
Elnaz Bigdeli
View author publications
You can also search for this author in PubMed Google Scholar
Fatemeh Cheraghchi
View author publications
You can also search for this author in PubMed Google Scholar
Bijan Raahemi
View author publications
You can also search for this author in PubMed Google Scholar
Behrouz H. Far
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arina Esmaeilpour .

Editor information

Editors and Affiliations

Lakehead University, Thunder Bay, Ontario, Canada
Richard Khoury
National Research Council Canada , Ottawa, Canada
Christopher Drummond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Esmaeilpour, A., Bigdeli, E., Cheraghchi, F., Raahemi, B., Far, B.H. (2016). Distributed Gaussian Mixture Model Summarization Using the MapReduce Framework. In: Khoury, R., Drummond, C. (eds) Advances in Artificial Intelligence. Canadian AI 2016. Lecture Notes in Computer Science(), vol 9673. Springer, Cham. https://doi.org/10.1007/978-3-319-34111-8_39

Download citation

DOI: https://doi.org/10.1007/978-3-319-34111-8_39
Published: 13 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34110-1
Online ISBN: 978-3-319-34111-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics