Abstract
Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG’s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Apache: Hadoop. https://hadoop.apache.org/
Apache: Spark. https://spark.apache.org/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Chang, F., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)
Garcia-Molina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book, 2nd edn. Pearson (2008)
Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In: ICDE (1996)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters. In: KDD (1996)
Silva, Y.N., Aref, W.G., Ali, M.: Similarity Group-by. In: ICDE (2009)
Tang, M., et al.: Similarity group-by operators for multi-dimensional relational data. IEEE Trans. Knowl. Data Eng. 28(2), 510–523 (2016)
Berkhin, P.: Survey of clustering data mining techniques. Accrue Software (2002)
Li, M., Holmes, G., Pfahringer, B.: Clustering large datasets using Cobweb and K-means in tandem. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 368–379. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30549-1_33
Farnstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl. 2(1), 51–57 (2000)
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1999)
Anchalia, P.P., Koundinya, A.K., Srinath, N.K.: MapReduce design of K-means clustering algorithm. In: ICISA (2013)
Apache: Spark Clustering. https://spark.apache.org/docs/latest/ml-clustering.html
Silva, Y.N., Arshad, M., Aref, W.G.: Exploiting similarity-aware grouping in decision support systems. In: EDBT (2009)
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2), 7:1–7:38 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, Y.N., Sandoval, M., Prado, D., Wallace, X., Rong, C. (2019). Similarity Grouping in Big Data Systems. In: Amato, G., Gennaro, C., Oria, V., Radovanović , M. (eds) Similarity Search and Applications. SISAP 2019. Lecture Notes in Computer Science(), vol 11807. Springer, Cham. https://doi.org/10.1007/978-3-030-32047-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-32047-8_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32046-1
Online ISBN: 978-3-030-32047-8
eBook Packages: Computer ScienceComputer Science (R0)