Similarity Grouping in Big Data Systems

Silva, Yasin N.; Sandoval, Manuel; Prado, Diana; Wallace, Xavier; Rong, Chuitian

doi:10.1007/978-3-030-32047-8_19

Yasin N. Silva¹²,
Manuel Sandoval¹²,
Diana Prado¹²,
Xavier Wallace¹² &
…
Chuitian Rong¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11807))

Included in the following conference series:

International Conference on Similarity Search and Applications

1118 Accesses

Abstract

Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG’s design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Set Similarity Joins with Complex Expressions on Distributed Platforms

$$\partial u\partial u$$ Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data

Diversity Similarity Join for Big Data

References

Apache: Hadoop. https://hadoop.apache.org/
Apache: Spark. https://spark.apache.org/
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)
Google Scholar
Chang, F., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2), 1–26 (2008)
Article Google Scholar
Garcia-Molina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book, 2nd edn. Pearson (2008)
Google Scholar
Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In: ICDE (1996)
Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters. In: KDD (1996)
Google Scholar
Silva, Y.N., Aref, W.G., Ali, M.: Similarity Group-by. In: ICDE (2009)
Google Scholar
Tang, M., et al.: Similarity group-by operators for multi-dimensional relational data. IEEE Trans. Knowl. Data Eng. 28(2), 510–523 (2016)
Article Google Scholar
Berkhin, P.: Survey of clustering data mining techniques. Accrue Software (2002)
Google Scholar
Li, M., Holmes, G., Pfahringer, B.: Clustering large datasets using Cobweb and K-means in tandem. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 368–379. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30549-1_33
Chapter Google Scholar
Farnstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl. 2(1), 51–57 (2000)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1999)
Article Google Scholar
Anchalia, P.P., Koundinya, A.K., Srinath, N.K.: MapReduce design of K-means clustering algorithm. In: ICISA (2013)
Google Scholar
Apache: Spark Clustering. https://spark.apache.org/docs/latest/ml-clustering.html
Silva, Y.N., Arshad, M., Aref, W.G.: Exploiting similarity-aware grouping in decision support systems. In: EDBT (2009)
Google Scholar
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2), 7:1–7:38 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Arizona State University, Glendale, USA
Yasin N. Silva, Manuel Sandoval, Diana Prado & Xavier Wallace
Tianjin Polytechnic University, Tianjin, China
Chuitian Rong

Authors

Yasin N. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Sandoval
View author publications
You can also search for this author in PubMed Google Scholar
Diana Prado
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Wallace
View author publications
You can also search for this author in PubMed Google Scholar
Chuitian Rong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasin N. Silva .

Editor information

Editors and Affiliations

ISTI-CNR, Pisa, Italy
Giuseppe Amato
ISTI-CNR, Pisa, Italy
Claudio Gennaro
New Jersey Institute of Technology, Newark, NJ, USA
Vincent Oria
University of Novi Sad, Novi Sad, Serbia
Miloš Radovanović

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, Y.N., Sandoval, M., Prado, D., Wallace, X., Rong, C. (2019). Similarity Grouping in Big Data Systems. In: Amato, G., Gennaro, C., Oria, V., Radovanović , M. (eds) Similarity Search and Applications. SISAP 2019. Lecture Notes in Computer Science(), vol 11807. Springer, Cham. https://doi.org/10.1007/978-3-030-32047-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-32047-8_19
Published: 23 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32046-1
Online ISBN: 978-3-030-32047-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics