Abstract
Real-time entity resolution (ER) is a challenging problem for large datasets. Traditional techniques of top-N join query processing are based on clean data without ER. For dirty datasets with duplicate tuples referring to the same real-world entity, these techniques may yield duplicates of top-N tuples for a query, and as a result some useful tuples may fail to be retrieved from the datasets, which leads to poor effectiveness. Based on “sorted and/or random accesses” and “no wild guesses”, in this paper, we discuss the models that integrate real-time entity resolution with top-N join queries over dirty datasets of real vectors. For finite dimensional \(\ell_{p} \) spaces and p-norm distances as nonmonotone ranking functions, using the norm equivalence theorem in Functional Analysis as a foundation, and designing buffers to join tuples with an outer-join mechanism and to cluster candidates for ER, we propose two database-friendly algorithms to answer the top-N join queries with the following two cases of data access methods: restricting sorted access and no random access. Extensive experiments are conducted to measure the effectiveness and efficiency of our approaches over various dirty datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: QuERy: a framework for integrating entity resolution with query processing. Proc. VLDB Endow. 9(3), 120–131 (2015)
Blake, C., Merz, C.: UCI repository of machine learning databases (1998). http://archive.ics.uci.edu/ml/datasets/Covertype. Accessed 5 Sept 2019
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53, 1–42 (2021)
Conway, J.B.: A Course in Functional Analysis. Springer, New York (1985). https://doi.org/10.1007/978-1-4757-3828-5
Dai, W., Qiu, L., Wu, A., Qiu, M.: Cloud infrastructure resource allocation for big data applications. IEEE Trans. Big Data 4(3), 313–324 (2018)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)
Gai, K., Qiu, M., Zhao, H., Sun, X.: Resource management in sustainable cyber-physical systems using heterogeneous cloud computing. IEEE Trans. Sustain. Comput. 3(2), 60–72 (2017)
Gao, J., Liu, W., Li, Z., Zhang, J., Shen, L.: A general fragments allocation method for join query in distributed database. Inf. Sci. 512, 1249–1263 (2020)
Getoor, L., Machanavajjhala, A.: Entity resolution for big data. Tutorial at ACM SIGKDD. In: KDD (2013). http://www.umiacs.umd.edu/~getoor/Tutorials/ER_KDD2013.pdf
Han, X., Li, J., Wang, J., Yang, D.: TJJE: An efficient algorithm for top-k join on massive data. Inf. Sci. 222, 362–383 (2013)
Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Joining ranked inputs in practice. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB) (2002)
Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40(4), Article 11 (2008)
Kejriwal, M.: Entity resolution in a big data framework. In: Proceedings of Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 4243–4244 (2015)
Li, C., Chang, K., Ilyas, I.F., Song, S.: RankSQL: query algebra and optimization for relational top-k queries. In: SIGMOD, pp. 131–142 (2005)
Liang, H., Wang, Y., Christen, P., Gayler, R.: Noise-tolerant approximate blocking for dynamic real-time entity resolution. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8444, pp. 449–460. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06605-9_37
Lin, C., Lu, J., Wei, Z., Wang, J., Xiao, X.: Optimal algorithms for selecting top-k combinations of attributes: theory and applications. VLDB J. 27, 27–52 (2018)
Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, New York (2011)
Qiao, B., Hu, B., Zhu, J., Wu, G., Giraud-Carrier, C., Wang, G.: A top-k spatial join querying processing algorithm based on spark. Inf. Syst. 87, 101419 (2020)
Ramakrishnan, R.: Database Management Systems. WCB/McGraw-Hill, Boston (1998)
Singh, V., Singh, A. K.: SIMP: accurate and efficient near neighbor search in high dimensional spaces. In: EDBT, pp. 492–503 (2012)
Tziavelis, N., Gatterbauer, W., Riedewald, M.: Optimal join algorithms meet top-k. In: SIGMOD, pp. 2659–2665 (2020)
Wu, M., Berti-Equille, L., Marian, A., Procopiuc, C., Srivastava, D.: Processing top-k join queries. Proc. VLDB Endow. 3(1), 860–870 (2010)
Zhu, L., Cheng, Y., Wang, Y., Ma, Q., Meng, W.: Evaluating top-N join queries with real-time entity resolution. J. Phys.: Conf. Ser. 1575(1), 012084 (2020)
Zhu, L., Liu, F., Meng, W., Ma, Q., Wang, Y., Yuan, F.: Evaluating top-N queries in n-dimensional normed spaces. Inf. Sci. 374, 255–275 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhu, L., Li, X., Wei, Y., Ma, Q., Meng, W. (2021). Integrating Real-Time Entity Resolution with Top-N Join Query Processing. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12817. Springer, Cham. https://doi.org/10.1007/978-3-030-82153-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-82153-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82152-4
Online ISBN: 978-3-030-82153-1
eBook Packages: Computer ScienceComputer Science (R0)