Skip to main content

Integrating Real-Time Entity Resolution with Top-N Join Query Processing

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12817))

Abstract

Real-time entity resolution (ER) is a challenging problem for large datasets. Traditional techniques of top-N join query processing are based on clean data without ER. For dirty datasets with duplicate tuples referring to the same real-world entity, these techniques may yield duplicates of top-N tuples for a query, and as a result some useful tuples may fail to be retrieved from the datasets, which leads to poor effectiveness. Based on “sorted and/or random accesses” and “no wild guesses”, in this paper, we discuss the models that integrate real-time entity resolution with top-N join queries over dirty datasets of real vectors. For finite dimensional \(\ell_{p} \) spaces and p-norm distances as nonmonotone ranking functions, using the norm equivalence theorem in Functional Analysis as a foundation, and designing buffers to join tuples with an outer-join mechanism and to cluster candidates for ER, we propose two database-friendly algorithms to answer the top-N join queries with the following two cases of data access methods: restricting sorted access and no random access. Extensive experiments are conducted to measure the effectiveness and efficiency of our approaches over various dirty datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: QuERy: a framework for integrating entity resolution with query processing. Proc. VLDB Endow. 9(3), 120–131 (2015)

    Article  Google Scholar 

  2. Blake, C., Merz, C.: UCI repository of machine learning databases (1998). http://archive.ics.uci.edu/ml/datasets/Covertype. Accessed 5 Sept 2019

  3. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  4. Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53, 1–42 (2021)

    Article  Google Scholar 

  5. Conway, J.B.: A Course in Functional Analysis. Springer, New York (1985). https://doi.org/10.1007/978-1-4757-3828-5

    Book  MATH  Google Scholar 

  6. Dai, W., Qiu, L., Wu, A., Qiu, M.: Cloud infrastructure resource allocation for big data applications. IEEE Trans. Big Data 4(3), 313–324 (2018)

    Article  Google Scholar 

  7. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)

    Article  MathSciNet  Google Scholar 

  8. Gai, K., Qiu, M., Zhao, H., Sun, X.: Resource management in sustainable cyber-physical systems using heterogeneous cloud computing. IEEE Trans. Sustain. Comput. 3(2), 60–72 (2017)

    Article  Google Scholar 

  9. Gao, J., Liu, W., Li, Z., Zhang, J., Shen, L.: A general fragments allocation method for join query in distributed database. Inf. Sci. 512, 1249–1263 (2020)

    Article  Google Scholar 

  10. Getoor, L., Machanavajjhala, A.: Entity resolution for big data. Tutorial at ACM SIGKDD. In: KDD (2013). http://www.umiacs.umd.edu/~getoor/Tutorials/ER_KDD2013.pdf

  11. Han, X., Li, J., Wang, J., Yang, D.: TJJE: An efficient algorithm for top-k join on massive data. Inf. Sci. 222, 362–383 (2013)

    Article  MathSciNet  Google Scholar 

  12. Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Joining ranked inputs in practice. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB) (2002)

    Google Scholar 

  13. Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40(4), Article 11 (2008)

    Google Scholar 

  14. Kejriwal, M.: Entity resolution in a big data framework. In: Proceedings of Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 4243–4244 (2015)

    Google Scholar 

  15. Li, C., Chang, K., Ilyas, I.F., Song, S.: RankSQL: query algebra and optimization for relational top-k queries. In: SIGMOD, pp. 131–142 (2005)

    Google Scholar 

  16. Liang, H., Wang, Y., Christen, P., Gayler, R.: Noise-tolerant approximate blocking for dynamic real-time entity resolution. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8444, pp. 449–460. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06605-9_37

    Chapter  Google Scholar 

  17. Lin, C., Lu, J., Wei, Z., Wang, J., Xiao, X.: Optimal algorithms for selecting top-k combinations of attributes: theory and applications. VLDB J. 27, 27–52 (2018)

    Article  Google Scholar 

  18. Özsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer, New York (2011)

    Google Scholar 

  19. Qiao, B., Hu, B., Zhu, J., Wu, G., Giraud-Carrier, C., Wang, G.: A top-k spatial join querying processing algorithm based on spark. Inf. Syst. 87, 101419 (2020)

    Article  Google Scholar 

  20. Ramakrishnan, R.: Database Management Systems. WCB/McGraw-Hill, Boston (1998)

    MATH  Google Scholar 

  21. Singh, V., Singh, A. K.: SIMP: accurate and efficient near neighbor search in high dimensional spaces. In: EDBT, pp. 492–503 (2012)

    Google Scholar 

  22. Tziavelis, N., Gatterbauer, W., Riedewald, M.: Optimal join algorithms meet top-k. In: SIGMOD, pp. 2659–2665 (2020)

    Google Scholar 

  23. Wu, M., Berti-Equille, L., Marian, A., Procopiuc, C., Srivastava, D.: Processing top-k join queries. Proc. VLDB Endow. 3(1), 860–870 (2010)

    Article  Google Scholar 

  24. Zhu, L., Cheng, Y., Wang, Y., Ma, Q., Meng, W.: Evaluating top-N join queries with real-time entity resolution. J. Phys.: Conf. Ser. 1575(1), 012084 (2020)

    Google Scholar 

  25. Zhu, L., Liu, F., Meng, W., Ma, Q., Wang, Y., Yuan, F.: Evaluating top-N queries in n-dimensional normed spaces. Inf. Sci. 374, 255–275 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liang Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, L., Li, X., Wei, Y., Ma, Q., Meng, W. (2021). Integrating Real-Time Entity Resolution with Top-N Join Query Processing. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12817. Springer, Cham. https://doi.org/10.1007/978-3-030-82153-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-82153-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-82152-4

  • Online ISBN: 978-3-030-82153-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics