skip to main content
research-article

Connected Components for Scaling Partial-order Blocking to Billion Entities

Published:19 March 2024Publication History
Skip Abstract Section

Abstract

In entity resolution, blocking pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related blocking-keys. Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but—as was shown for author disambiguation—the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset partial order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions, and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.

REFERENCES

  1. [1] Aho Alfred V., Garey Michael R., and Ullman Jeffrey D.. 1972. The transitive reduction of a directed graph. SIAM J. Comput. 1, 2 (1972), 131137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Anderson Kris E., Talburt John R., Hagan Nicholas K. A., Zimmerman Timothy J., and Hagan Deasia. 2023. Optimal starting parameters for unsupervised data clustering and cleaning in the data washing machine. In Proceedings of the Future Technologies Conference. Springer, 106125.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Backes Tobias. 2018. Effective unsupervised author disambiguation with relative frequencies. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (JCDL’18). ACM Press, 203212.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Backes Tobias. 2018. The impact of name-matching and blocking on author disambiguation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM Press, 803812.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Backes Tobias. 2023. Partial Orders and Progressive Blocking: A Matching-based Framework for Large-scale Entity Resolution in Bibliographic Data. Ph. D. Dissertation. Universitäts-und Landesbibliothek der Heinrich-Heine-Universität Düsseldorf.Google ScholarGoogle Scholar
  6. [6] Backes Tobias and Dietze Stefan. 2022. Lattice-based progressive author disambiguation. Inf. Syst. 109 (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Backes Tobias, Hienert Daniel, and Dietze Stefan. 2022. Towards hierarchical affiliation resolution: framework, baselines, dataset. Int. J. Digit. Librar. 23, 3 (2022), 267288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Bayardo Roberto J. and Panda Biswanath. 2011. Fast algorithms for finding extremal sets. In Proceedings of the SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2534.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Beneventano Domenico, Bergamaschi Sonia, Gagliardelli Luca, and Simonini Giovanni. 2020. BLAST2: An efficient technique for loose schema information extraction from heterogeneous big data sources. J. Data Inform. Qual. 12, 4 (2020), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Bouros Panagiotis, Mamoulis Nikos, Ge Shen, and Terrovitis Manolis. 2016. Set containment join revisited. Knowl. Inf. Syst. 49, 1 (2016), 375402.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Charikar Moses S.. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing. Association for Computing Machinery, 380388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Christen Peter and Gayler Ross. 2008. Towards scalable real-time entity resolution using a similarity-aware inverted index approach. In Proceedings of the 7th Australasian Data Mining Conference (AusDM’08). Australian Computer Society, Inc., 5160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Manning Prabhakar Raghavan Christopher D. and Schütze Hinrich. 2008. Introduction to Information Retrieval. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Christophides Vassilis, Efthymiou Vasilis, Palpanas Themis, Papadakis George, and Stefanidis Kostas. 2020. An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53, 6 (2020), 142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Christophides Vassilis, Efthymiou Vasilis, and Stefanidis Kostas. 2015. Entity Resolution in the Web of Data. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Cao Nicola De, Izacard Gautier, Riedel Sebastian, and Petroni Fabio. 2021. Autoregressive Entity Retrieval. Retrieved from http://arxiv.org/abs/2010.00904Google ScholarGoogle Scholar
  17. [17] Donner Paul, Rimmert Christine, and Eck Nees Jan van. 2020. Comparing institutional-level bibliometric research performance indicator values based on different affiliation disambiguation systems. Quantit. Sci. Stud. 1, 1 (Feb.2020), 150170.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Ebraheem Muhammad, Thirumuruganathan Saravanan, Joty Shafiq, Ouzzani Mourad, and Tang Nan. 2018. DeepER—Deep entity resolution. Proc. VLDB Endow. 11, 11 (July2018), 14541467. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Firmani Donatella, Saha Barna, and Srivastava Divesh. 2016. Online entity resolution using an Oracle. Proc. VLDB Endow. 9, 5 (Jan.2016), 384395. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Fort Marta, Sellarès J. Antoni, and Valladares Nacho. 2013. Finding extremal sets on the GPU. J. Parallel Distrib. Comput. 74, 1 (2013), 18911899.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Luca Gagliardelli, George Papadakis, Giovanni Simonini, Sonia Bergamaschi, and Themis Palpanas. 2022. Generalized supervised meta-blocking. Proc. VLDB Endow. 15, 9 (2022), 19021910.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Gazzarri Leonardo and Herschel Melanie. 2021. End-to-end task based parallelization for entity resolution on dynamic data. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE’21). IEEE, 12481259.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Gyawali Bikash, Anastasiou Lucas, and Knoth Petr. 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 901910.Google ScholarGoogle Scholar
  24. [24] Henzinger Monika. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. American Association for Computing Machinery, 284291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Hopcroft John and Tarjan Robert. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM 16, 6 (1973), 372378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Jampani Ravindranath and Pudi Vikram. 2005. Using prefix-trees for efficiently computing set joins. In Proceedings of the 10th International Conference on Database Systems for Advanced Applications. Springer, 761772.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Kandpal Nikhil, Deng Haikang, Roberts Adam, Wallace Eric, and Raffel Colin. 2023. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 1569615707.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Köpcke Hanna, Thor Andreas, and Rahm Erhard. 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 1-2 (Sept.2010), 484493. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Leiserson Charles E., Maza Marc Moreno, Li Liyun, and Xie Yuzhen. 2010. Parallel computation of the minimal elements of a poset. In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation. Association for Computing Machinery, 5362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Li Bing, Miao Yukai, Wang Yaoshu, Sun Yifang, and Wang Wei. 2021. Improving the efficiency and effectiveness for BERT-based entity resolution. Proc. AAAI Conf. Artif. Intell. 35, 15 (May2021), 1322613233. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Li Bo-Han, Liu Yi, Zhang An-Man, Wang Wen-Huan, and Wan Shuo. 2020. A survey on blocking technology of entity resolution. J. Comput. Sci. Technol. 35 (2020), 769793.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Ling Xiao, Singh Sameer, and Weld Daniel S.. 2015. Design challenges for entity linking. Trans. Assoc. Computat. Ling. 3 (Dec.2015), 315328. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Luo Jizhou, Zhang Wei, Shi Shengfei, Gao Hong, Li Jianzhong, Wu Wei, and Jiang Shouxu. 2019. FreshJoin: An efficient and adaptive algorithm for set containment join. Data Sci. Eng. 4, 4 (2019), 293308.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Luo Yongming, Fletcher George H. L., Hidders Jan, and Bra Paul De. 2015. Efficient and scalable trie-based algorithms for computing set containment relations. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 303314.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Manku Gurmeet Singh, Jain Arvind, and Sarma Anish Das. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web. American Association for Computing Machinery, 141150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Marinov Martin and Gregg D.. 2014. A practical algorithm for finding extremal sets up to permutation. J. Experim. Algor. 9, 4 (2014).Google ScholarGoogle Scholar
  37. [37] Marinov Martin, Nash Nicholas, and Gregg David. 2016. Practical algorithms for finding extremal sets. J. Experim. Algor. 21 (2016), 121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Mudgal Sidharth, Li Han, Rekatsinas Theodoros, Doan AnHai, Park Youngchoon, Krishnan Ganesh, Deep Rohit, Arcaute Esteban, and Raghavendra Vijay. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data. ACM Press, 1934.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Nauman Felix and Herschel Melanie. 2022. An Introduction to Duplicate Detection. Springer Nature.Google ScholarGoogle Scholar
  40. [40] Papadakis George, Skoutas Dimitrios, Thanos Emmanouil, and Palpanas Themis. 2020. Blocking and filtering techniques for entity resolution: A survey. Comput. Surv. 53, 2 (May2020), 142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Peeters Ralph and Bizer Christian. 2023. Entity Matching using Large Language Models. Retrieved from http://arxiv.org/abs/2310.11244Google ScholarGoogle Scholar
  42. [42] Pritchard Paul. 1991. Opportunistic algorithms for eliminating supersets. Acta Inform. 28, 8 (1991), 733754.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Pritchard Paul. 1995. A simple sub-quadratic algorithm for computing the subset partial order. Inform. Process. Lett. 56, 6 (1995), 337341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Pritchard Paul. 1997. An old sub-quadratic algorithm for rinding extremal sets. Inform. Process. Lett. 62, 6 (1997), 329334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Pritchard Paul. 1999. On computing the subset graph of a collection of sets. J. Algor. 33, 2 (1999), 187203.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Ravichandran Deepak and Vassilvitski Sergei. 2021. Evaluation of Cohort Algorithms for the FloC API. Technical Report. Google Research & Ads.Google ScholarGoogle Scholar
  47. [47] Rimmert C., Schwechheimer H., and Winterhager M.. 2017. Disambiguation of Author Addresses in Bibliometric Databases. Technical Report. Bielefeld University.Google ScholarGoogle Scholar
  48. [48] Savnik Iztok, Akulich Mikita, Krnc Matjaž, and Škrekovski Riste. 2021. Data structure set-trie for storing and querying sets: Theoretical and empirical analysis. PLoS One 16, 2 (2021).Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Sevgili Özge, Shelmanov Artem, Arkhipov Mikhail, Panchenko Alexander, and Biemann Chris. 2022. Neural entity linking: A survey of models based on deep learning. Semant. Web 13, 3 (Apr.2022), 527570. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Shen Hong. 1998. Fully dynamic algorithms for maintaining extremal sets in a family of sets. Int. J. Comput. Math. 69, 3-4 (Jan.1998), 203215.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Shen Hong and Evans David J.. 1996. Fast sequential and parallel algorithms for finding extremal sets. Int. J. Comput. Math. 61, 3-4 (1996), 195211.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Shen Wei, Li Yuhan, Liu Yinan, Han Jiawei, Wang Jianyong, and Yuan Xiaojie. 2021. Entity Linking Meets Deep Learning: Techniques and Solutions. Retrieved from http://arxiv.org/abs/2109.12520Google ScholarGoogle Scholar
  53. [53] Shen Wei, Wang Jianyong, and Han Jiawei. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 27, 2 (Feb.2015), 443460. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Talburt John R.. 2011. Entity Resolution and Information Quality. Elsevier.Google ScholarGoogle Scholar
  55. [55] Talburt John R., Pullen Daniel, Claassens Leon, Wang Richard, et al. 2020. An iterative, self-assessing entity resolution system: First steps toward a data washing machine. Int. J. Advanc. Comput. Sci. Applic. 11, 12 (2020).Google ScholarGoogle Scholar
  56. [56] Talburt John R. and Zhou Yinle. 2013. A practical guide to entity resolution with OYSTER. In Handbook of Data Quality: Research and Practice. Springer, 235270.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Tang Jiawei, Zuo Yifei, Cao Lei, and Madden Samuel. 2022. Generic entity resolution models. In Proceedings of the NeurIPS 2022 First Table Representation Workshop.Google ScholarGoogle Scholar
  58. [58] Thirumuruganathan Saravanan, Li Han, Tang Nan, Ouzzani Mourad, Govind Yash, Paulsen Derek, Fung Glenn, and Doan AnHai. 2021. Deep learning for blocking in entity matching: A design space exploration. Proc. VLDB Endow. 14, 11 (July2021), 24592472. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Trajanoska Milena, Stojanov Riste, and Trajanov Dimitar. 2023. Enhancing Knowledge Graph Construction Using Large Language Models. Retrieved from http://arxiv.org/abs/2305.04676Google ScholarGoogle Scholar
  60. [60] Verroios Vasilis and Garcia-Molina Hector. 2019. Top-k entity resolution with adaptive locality-sensitive hashing. In Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE’19). IEEE, 17181721.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Wang Qing, Cui Mingyuan, and Liang Huizhi. 2015. Semantic-aware blocking for entity resolution. IEEE Trans. Knowl. Data Eng. 28, 1 (2015), 166180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Wang Yifan. 2022. A Survey on Efficient Processing of Similarity Queries over Neural Embeddings. Retrieved from http://arxiv.org/abs/2204.07922Google ScholarGoogle Scholar
  63. [63] Yang Chengcheng, Deng Dong, Shang Shuo, Zhu Fan, Liu Li, and Shao Ling. 2021. Internal and external memory set containment join. VLDB J. 30, 3 (2021), 447470.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Yellin Daniel M.. 1992. Algorithms for subset testing and finding maximal sets. In Proceedings of the 3rd Annual ACM-SIAM Symposium on Discrete Algorithms. ACM & SIAM, 386392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Yellin Daniel M. and Jutla Charanjit S.. 1993. Finding extremal sets in less than quadratic time. Inform. Process. Lett. 48, 1 (1993), 2934.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Yu Minghe, Li Guoliang, Deng Dong, and Feng Jianhua. 2016. String similarity search and join: A survey. Front. Comput. Sci. 10, 3 (2016), 399417.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Zeakis Alexandros, Papadakis George, Skoutas Dimitrios, and Koubarakis Manolis. 2023. Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark]. Retrieved from http://arxiv.org/abs/2304.12329Google ScholarGoogle Scholar

Index Terms

  1. Connected Components for Scaling Partial-order Blocking to Billion Entities

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Journal of Data and Information Quality
            Journal of Data and Information Quality  Volume 16, Issue 1
            March 2024
            187 pages
            ISSN:1936-1955
            EISSN:1936-1963
            DOI:10.1145/3613486
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 19 March 2024
            • Online AM: 20 February 2024
            • Accepted: 27 December 2023
            • Revised: 3 December 2023
            • Received: 9 August 2023
            Published in jdiq Volume 16, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)72
            • Downloads (Last 6 weeks)19

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text