Abstract
Privacy preserving record linkage refers to the problem of matching records from two or more data holders without revealing any personal identifiers, thus, maintaining the privacy of the individuals described by these records. While parallel processing has been deployed in the context of privacy preserving record linkage for handling big data, in this paper, we further explore parallel methods based on Apache Spark and phonetic codes and propose improvements, which manage to achieve superior performance with respect to time efficiency and privacy characteristics. To support our claims, we provide extensive experimental results and a rigorous discussion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Bonomi, L., Huang, Y., Ohno-Machado, L.: Privacy challenges and research opportunities for genomic data sharing. Nat. Genet. 52(7), 646–654 (2020)
Chen, F., et al.: Perfectly secure and efficient two-party electronic-health-record linkage. IEEE Internet Comput. 22(2), 32–41 (2018)
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Data-Centric Systems and Applications. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Christen, P., Ranbaduge, T., Schnell, R.: Linking Sensitive Data - Methods and Techniques for Practical Privacy-Preserving Information Sharing. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59706-1
Cruz, I.F., Tamassia, R., Yao, D.: Privacy-preserving schema matching using mutual information. In: Barker, S., Ahn, G.-J. (eds.) DBSec 2007. LNCS, vol. 4602, pp. 93–94. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73538-0_7
Durham, E., Kantarcioglu, M., Xue, Y., Toth, C., Kuzu, M., Malin, B., et al.: Composite bloom filters for secure record linkage. IEEE Trans. Knowl. Data Eng. 26(12), 2956–2968 (2014)
Essex, A.: Secure approximate string matching for privacy-preserving record linkage. IEEE Trans. Inf. Forensics Secur. 14(10), 2623–2632 (2019)
Franke, M., Sehili, Z., Rahm, E.: Parallel privacy-preserving record linkage using LSH-based blocking. In: 3rd International Conference on Internet of Things, Big Data and Security, pp. 195–203. SciTePress (2018)
Franke, M., Sehili, Z., Rohde, F., Rahm, E.: Evaluation of hardening techniques for privacy-preserving record linkage. In: 24th International Conference on Extending Database Technology, pp. 289–300. OpenProceedings.org (2021)
Gkoulalas-Divanis, A., Vatsalan, D., Karapiperis, D., Kantarcioglu, M.: Modern privacy-preserving record linkage techniques: An overview. IEEE Trans. Inf. Forensics Secur. 16, 4966–4987 (2021)
Goodrich, M.T.: The mastermind attack on genomic data. In: 30th IEEE Symposium on Security and Privacy, pp. 204–218. IEEE Computer Society (2009)
Karakasidis, A., Koloniari, G.: Phonetics-based parallel privacy preserving record linkage. In: Xhafa, F., Caballé, S., Barolli, L. (eds.) 3PGCIC 2017. LNDECT, vol. 13, pp. 179–190. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-69835-9_16
Karakasidis, A., Koloniari, G., Verykios, V.S.: Scalable blocking for privacy preserving record linkage. In: The 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 527–536. ACM (2015)
Karakasidis, A., Verykios, V.S.: Privacy preserving record linkage using phonetic codes. In: Fourth Balkan Conference in Informatics, pp. 101–106. IEEE Computer Society (2009)
Karakasidis, A., Verykios, V.S., Christen, P.: Fake injection strategies for private phonetic matching. In: Garcia-Alfaro, J., Navarro-Arribas, G., Cuppens-Boulahia, N., de Capitani di Vimercati, S. (eds.) DPM/SETOP -2011. LNCS, vol. 7122, pp. 9–24. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28879-1_2
Karapiperis, D., Verykios, V.S.: A distributed near-optimal LSH-based framework for privacy-preserving record linkage. Comput. Sci. Inf. Syst. 11(2), 745–763 (2014)
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with hadoop. Proceed. VLDB Endow. 5(12), 1878–1881 (2012)
Koneru, K., Varol, C.: Privacy preserving record linkage using metasoundex algorithm. In: 16th IEEE International Conference on Machine Learning and Applications, pp. 443–447. IEEE (2017)
Mullaymeri, X., Karakasidis, A.: Using fuzzy vaults for privacy preserving record linkage. In: The 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data. CEUR Workshop Proceedings, vol. 2840, pp. 101–110. CEUR-WS.org (2021)
Odell, M., Russell, R.: The soundex coding system. US Patents 1261167 (1918)
Philips, L.: Hanging on the metaphone. Comput. Lang. 7(12), 39–43 (1990)
Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., Rasella, D.: A spark-based workflow for probabilistic record linkage of healthcare data. In: Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference. CEUR Workshop Proceedings, vol. 1330, pp. 17–26. CEUR-WS.org (2015)
Ranbaduge, T., Christen, P., Schnell, R.: Secure and accurate two-step hash encoding for privacy-preserving record linkage. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 139–151. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_11
Rao, F., Cao, J., Bertino, E., Kantarcioglu, M.: Hybrid private record linkage: Separating differentially private synopses from matching records. ACM Trans. Priv. Secur. 22(3), 1–36 (2019)
Saleem, A., Khan, A., Shahid, F., Alam, M., Khan, M.K.: Recent advancements in garbled computing: How far have we come towards achieving secure, efficient and reusable garbled circuits. J. Netw. Comput. Appl. 108, 1–19 (2018)
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on apache spark. Int. J. Data Sci. Anal. 1(3–4), 145–164 (2016)
Scannapieco, M., Figotin, I., Bertino, E., Elmagarmid, A.K.: Privacy preserving schema and data matching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 653–664. ACM (2007)
Schnell, R., Bachteler, T., Reiher, J.: Privacy-preserving record linkage using bloom filters. BMC Med. Inform. Decis. Mak. 9, 41 (2009)
Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: The 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2323–2324. ACM (2015)
Smith, D.: Secure pseudonymisation for privacy-preserving probabilistic record linkage. J. Inf. Secur. Appl. 34, 271–279 (2017)
Vatsalan, D., Sehili, Z., Christen, P., Rahm, E.: Privacy-preserving record linkage for big data: current approaches and research challenges. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 851–895. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4_25
Vidanage, A., Ranbaduge, T., Christen, P., Schnell, R.: A taxonomy of attacks on privacy-preserving record linkage. J. Priv. Confidentiality 12(1), jpc.764 (2022)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2010. USENIX Association (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Karakasidis, A., Koloniari, G. (2023). More Sparking Soundex-Based Privacy-Preserving Record Linkage. In: Foschini, L., Kontogiannis, S. (eds) Algorithmic Aspects of Cloud Computing. ALGOCLOUD 2022. Lecture Notes in Computer Science, vol 13799. Springer, Cham. https://doi.org/10.1007/978-3-031-33437-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-33437-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33436-8
Online ISBN: 978-3-031-33437-5
eBook Packages: Computer ScienceComputer Science (R0)