Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Chen, Xiao; Zoun, Roman; Schallehn, Eike; Mantha, Sravani; Rapuru, Kirity; Saake, Gunter

doi:10.1007/978-3-319-99987-6_1

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Xiao Chen¹³,
Roman Zoun¹³,
Eike Schallehn¹³,
Sravani Mantha¹⁴,
Kirity Rapuru¹³ &
…
Gunter Saake¹³

Conference paper
First Online: 31 August 2018

954 Accesses
3 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 928))

Abstract

Entity Resolution (ER) is a task to identify records that refer to the same real-world entities. A naive way to solve ER tasks is to calculate the similarity of the Cartesian product of all records, which is called pair-wise ER and leads to quadratic time complexity. Faced with an exploding data volume, pair-wise ER is challenged to achieve high efficiency and scalability. To tackle this challenge, parallel computing is proposed for speeding up the ER process. Due to the difficulty of distributed programming, big data processing frameworks are often used as tools to ease the realization of parallel ER, supporting data partitioning, workload balancing, and fault tolerance. However, the efficiency and scalability of parallel ER is also influenced by the adopted framework. In the area of parallel ER, the adoption of Apache Spark, a general framework supporting in-memory computation, still is not widely studied. Furthermore, though Apache Spark provides both low-level (RDD-based) and high-level APIs (Datasets-based), to date, only RDD-based APIs have been adopted in parallel ER research. In this paper, we have implemented a Spark-SQL-based ER process and explored its persistence capability to see the performance benefits. We have evaluated its speedup and compared its efficiency to Spark-RDD-based ER. We observed that different persistence options have a large impact on the efficiency of Spark-SQL-based ER, requiring a careful consideration for choosing it. By adopting the best persistence option, the efficiency of our Spark-SQL-based ER implementation is improved up to 3 times on different datasets, over a baseline without any persistence option or with misconfigured persistence.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Article Google Scholar
Apache: Apache spark. http://spark.apache.org/. Accessed 10 Apr 2017
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)
Google Scholar
Benjelloun, O., et al.: D-Swoosh: a family of algorithms for generic, distributed entity resolution. In: 27th International Conference on Distributed Computing Systems, ICDCS 2007, p. 37. IEEE (2007)
Google Scholar
Bowes, R.: Facebook names dataset. http://academictorrents.com/details/e54c73099d291605e7579b90838c2cd86a8e9575. Accessed 15 June 2017
Chen, D., Shen, C., Feng, J., Le, J.: An efficient parallel top-k similarity join for massive multidimensional data using spark. Int. J. Database Theory Appl. 8(3), 57–68 (2015)
Article Google Scholar
Chen, X., Schallehn, E., Saake, G.: Cloud-scale entity resolution: current state and open challenges. Open J. Big Data (OJBD) 4(1), 30–51 (2018)
Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM 2013, pp. 1165–1168. ACM, New York (2013)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 1527. ACM (2013)
Google Scholar
Hameurlain, A., Morvan, F.: Big data management in the cloud: evolution or crossroad? In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 23–38. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34099-9_2
Chapter Google Scholar
Hortonworks: Hortonworks data platform. https://hortonworks.com/products/data-center/hdp/. Accessed 10 July 2017
Karau, H., Warren, R.: High Performance Spark. O’Reilly Media, Sebastopol (2017)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with Hadoop. Proc. VLDB Endow. 5(12), 1878–1881 (2012)
Article Google Scholar
Mestre, D.G., Pires, C.E.S., Nascimento, D.C., de Queiroz, A.R.M., Santos, V.B., Araujo, T.B.: An efficient spark-based adaptive windowing for entity matching. J. Syst. Softw. 128, 1–10 (2017)
Article Google Scholar
Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., Rasella, D.: A spark-based workflow for probabilistic record linkage of healthcare data. In: EDBT/ICDT Workshops, pp. 17–26 (2015)
Google Scholar
Rong, C., Lu, W., Du, X., Zhang, X.: Efficient duplicate detection on cloud using a new signature scheme. In: Wang, H., Li, S., Oyama, S., Hu, X., Qian, T. (eds.) WAIM 2011. LNCS, vol. 6897, pp. 251–263. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23535-1_23
Chapter Google Scholar
Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM 2013, pp. 2473–2476. ACM, New York (2013)
Google Scholar
Wang, C., Karimi, S.: Parallel duplicate detection in adverse drug reaction databases with spark. In: EDBT, pp. 551–562 (2016)
Google Scholar

Download references

Acknowledgments

The authors would like to thank China Scholarship Council [No. 201408080093] to fund our work. Besides, we are very grateful to Gabriel Campero Durand, David Broneske and Yusra Shakeel to provide us valuable feedback.

Author information

Authors and Affiliations

Otto-von-Guericke-University of Magdeburg, Magdeburg, Germany
Xiao Chen, Roman Zoun, Eike Schallehn, Kirity Rapuru & Gunter Saake
German Research Center For Artificial Intelligence, Berlin, Germany
Sravani Mantha

Authors

Xiao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Roman Zoun
View author publications
You can also search for this author in PubMed Google Scholar
Eike Schallehn
View author publications
You can also search for this author in PubMed Google Scholar
Sravani Mantha
View author publications
You can also search for this author in PubMed Google Scholar
Kirity Rapuru
View author publications
You can also search for this author in PubMed Google Scholar
Gunter Saake
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Chen .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, X., Zoun, R., Schallehn, E., Mantha, S., Rapuru, K., Saake, G. (2018). Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety. BDAS 2018. Communications in Computer and Information Science, vol 928. Springer, Cham. https://doi.org/10.1007/978-3-319-99987-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-99987-6_1
Published: 31 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99986-9
Online ISBN: 978-3-319-99987-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics